Sei sulla pagina 1di 654

P1: JZP

0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

This page intentionally left blank

ii
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

PSYCHOLOGICAL TESTING: AN INTRODUCTION


Second Edition

This book is an introductory text to the field of psychological testing primarily suitable
for undergraduate students in psychology, education, business, and related fields. This
book will also be of interest to graduate students who have not had prior exposure
to psychological testing and to professionals such as lawyers who need to consult a
useful source. Psychological Testing is clearly written, well organized, comprehensive,
and replete with illustrative materials. In addition to the basic topics, the text covers
in detail topics that are often neglected by other texts such as cross-cultural testing,
the issue of faking tests, the impact of computers, and the use of tests to assess positive
behaviors such as creativity.

George Domino is the former Director of Clinical Psychology and Professor of Psy-
chology at the University of Arizona. He was also the former director of the Counseling
Center and Professor of Psychology at Fordham University.

Marla L. Domino has a BA in Psychology, an MA in Criminal Law, and a PhD in


Clinical Psychology specializing in Psychology and Law. She also completed a post-
doctoral fellowship in Clinical-Forensic Psychology at the University of Massachusetts
Medical School, Law and Psychiatry Program. She is currently the Chief Psychologist
in the South Carolina Department of Mental Health’s Forensic Evaluation Service and
an assistant professor in the Department of Neuropsychiatry and Behavioral Sciences
at the University of South Carolina. She was recently awarded by the South Carolina
Department of Mental Health as Outstanding Employee of the Year in Forensics
(2004).

i
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

ii
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

SECOND EDITION

Psychological Testing
An Introduction

George Domino
University of Arizona

Marla L. Domino
Department of Mental Health, State of South Carolina

iii
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo

Cambridge University Press


The Edinburgh Building, Cambridge cb2 2ru, UK
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521861816

© Cambridge University Press 2006

This publication is in copyright. Subject to statutory exception and to the provision of


relevant collective licensing agreements, no reproduction of any part may take place
without the written permission of Cambridge University Press.

First published in print format 2006

isbn-13 978-0-511-22012-8 eBook (EBL)


isbn-10 0-511-22012-x eBook (EBL)

isbn-13 978-0-521-86181-6 hardback


isbn-10 0-521-86181-0 hardback

Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party internet websites referred to in this publication, and does not
guarantee that any content on such websites is, or will remain, accurate or appropriate.
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

Contents

Preface page ix
Acknowledgments xi

PART ONE. BASIC ISSUES

1 The Nature of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Aim, 1 r Introduction, 1 r Categories of Tests, 5 r Ethical Standards,
9 r Information about Tests, 11 r Summary, 12 r Suggested Readings,
14 r Discussion Questions, 14
2 Test Construction, Administration, and Interpretation . . . . . . . . . . 15
Aim, 15 r Constructing a Test, 15 r Test Items, 18 r Philosophical
Issues, 22 r Administering a Test, 25 r Interpreting Test Scores, 25 r
Item Characteristics, 28 r Norms, 34 r Combining Test Scores, 38 r
Summary, 40 r Suggested Readings, 41 r Discussion Questions, 41
3 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Aim, 42 r Introduction, 42 r Reliability, 42 r Types of Reliability, 43 r
Validity, 52 r Aspects of Validity, 57 r Summary, 65 r Suggested
Readings, 66 r Discussion Questions, 66

PART TWO. DIMENSIONS OF TESTING

4 Personality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Aim, 67 r Introduction, 67 r Some Basic Issues, 68 r Types of
Personality Tests, 70 r Examples of Specific Tests, 72 r The Big Five, 88 r
Summary, 91 r Suggested Readings, 91 r Discussion Questions, 91
5 Cognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Aim, 92 r Introduction, 92 r Theories of Intelligence, 94 r Other
Aspects, 97 r The Binet Tests, 100 r The Wechsler Tests, 105 r Other
Tests, 116 r Summary, 125 r Suggested Readings, 126 r Discussion
Questions, 126
6 Attitudes, Values, and Interests . . . . . . . . . . . . . . . . . . . . . . . . . 127
Aim, 127 r Attitudes, 127 r Values, 141 r Interests, 148 r Summary,
160 r Suggested Readings, 160 r Discussion Questions, 160

v
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

vi Contents

7 Psychopathology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Aim, 161 r Introduction, 161 r Measures, 163 r The Minnesota
Multiphasic Personality Inventory (MMPI) and MMPI-2, 170 r The
Millon Clinical Multiaxial Inventory (MCMI), 179 r Other Measures,
185 r Summary, 196 r Suggested Readings, 196 r Discussion
Questions, 196
8 Normal Positive Functioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Aim, 197 r Self-Concept, 197 r Locus of Control, 202 r Sexuality,
204 r Creativity, 205 r Imagery, 213 r Competitiveness, 215 r
Hope, 216 r Hassles, 218 r Loneliness, 218 r Death Anxiety, 219 r
Summary, 220 r Suggested Readings, 220 r Discussion Questions, 221

PART THREE. APPLICATIONS OF TESTING


9 Special Children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Aim, 223 r Some Issues Regarding Testing, 223 r Categories of Special
Children, 234 r Some General Issues About Tests, 246 r Summary,
255 r Suggested Readings, 255 r Discussion Questions, 256
10 Older Persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Aim, 257 r Some Overall Issues, 257 r Attitudes Toward the Elderly,
260 r Anxiety About Aging, 261 r Life Satisfaction, 261 r Marital
Satisfaction, 263 r Morale, 264 r Coping or Adaptation, 265 r Death
and Dying, 265 r Neuropsychological Assessment, 266 r Depression,
269 r Summary, 270 r Suggested Readings, 270 r Discussion
Questions, 271
11 Testing in a Cross-Cultural Context . . . . . . . . . . . . . . . . . . . . . . . 272
Aim, 272 r Introduction, 272 r Measurement Bias, 272 r
Cross-Cultural Assessment, 282 r Measurement of Acculturation, 284 r
Some Culture-Fair Tests and Findings, 287 r Standardized Tests, 293 r
Summary, 295 r Suggested Readings, 295 r Discussion Questions, 296
12 Disability and Rehabilitation . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Aim, 297 r Some General Concerns, 297 r Modified Testing, 300 r
Some General Results, 301 r Legal Issues, 304 r The Visually Impaired,
307 r Hearing Impaired, 312 r Physical-Motor Disabilities, 321 r
Summary, 323 r Suggested Readings, 323 r Discussion Questions, 324

PART FOUR. THE SETTINGS

13 Testing in the Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325


Aim, 325 r Preschool Assessment, 325 r Assessment in the Primary
Grades, 328 r High School, 331 r Admission into College, 334 r The
Graduate Record Examination, 342 r Entrance into Professional
Training, 348 r Tests for Licensure and Certification, 352 r Summary,
354 r Suggested Readings, 355 r Discussion Questions, 355
14 Occupational Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Aim, 356 r Some Basic Issues, 356 r Some Basic Findings, 356 r
Ratings, 359 r The Role of Personality, 360 r Biographical Data
(Biodata), 363 r Assessment Centers, 365 r Illustrative Industrial
Concerns, 371 r Testing in the Military, 373 r Prediction of Police
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

Contents vii

Performance, 376 r Examples of Specific Tests, 377 r Integrity Tests,


379 r Summary, 384 r Suggested Readings, 388 r Discussion
Questions, 389
15 Clinical and Forensic Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Aim, 390 r Clinical Psychology: Neuropsychological Testing, 390 r
Projective Techniques, 392 r Some Clinical Issues and Syndromes, 406 r
Health Psychology, 409 r Forensic Psychology, 419 r Legal Standards,
422 r Legal Cases, 422 r Summary, 426 r Suggested Readings, 426 r
Discussion Questions, 426

PART FIVE. CHALLENGES TO TESTING

16 The Issue of Faking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427


Aim, 427 r Some Basic Issues, 427 r Some Psychometric Issues, 432 r
Techniques to Discourage Faking, 434 r Related Issues, 435 r The
MMPI and Faking, 437 r The CPI and Faking, 443 r Social Desirability
and Assessment Issues, 444 r Acquiescence, 448 r Other Issues, 449 r
Test Anxiety, 456 r Testwiseness, 457 r Summary, 458 r Suggested
Readings, 458 r Discussion Questions, 459
17 The Role of Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Aim, 460 r Historical Perspective, 460 r Computer Scoring of Tests,
461 r Computer Administration of Tests, 462 r Computer-Based Test
Interpretations (CBTI), 467 r Some Specific Tests, 471 r Adaptive
Testing and Computers, 473 r Ethical Issues Involving Computer Use,
476 r Other Issues and Computer Use, 477 r A Look at Other Tests and
Computer Use, 478 r The Future of Computerized Psychological
Testing, 481 r Summary, 481 r Suggested Readings, 482 r Discussion
Questions, 482
18 Testing Behavior and Environments . . . . . . . . . . . . . . . . . . . . . . 483
Aim, 483 r Traditional Assessment, 483 r Behavioral Assessment, 484 r
Traditional vs. Behavioral Assessment, 488 r Validity of Behavioral
Assessment, 488 r Behavioral Checklists, 490 r Behavioral
Questionnaires, 492 r Program Evaluation, 501 r Assessment of
Environments, 502 r Assessment of Family Functioning, 506 r
Broad-Based Instruments, 510 r Summary, 515 r Suggested Readings,
515 r Discussion Questions, 516
19 The History of Psychological Testing . . . . . . . . . . . . . . . . . . . . . . 517
Aim, 517 r Introduction, 517 r The French Clinical Tradition, 518 r
The German Nomothetic Approach, 519 r The British Idiographic
Approach, 520 r The American Applied Orientation, 522 r Some
Recent Developments, 530 r Summary, 533 r Suggested Readings,
533 r Discussion Questions, 533
Appendix: Table to Translate Difficulty Level of a Test Item into
a z Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
References 537
Test Index 623
Index of Acronyms 627
Subject Index 629
P1: JZP
0521861810pre CB1038/Domino 0 521 86181 0 March 6, 2006 13:31

viii
P1: JZP
0521861810pre1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Preface

My first professional publication in 1963 was as important, and that is what we have tried to
a graduate student (with Harrison Gough) on emphasize.
a validational study of a culture-fair test. Since Because of my varied experience in indus-
then, I have taught a course on psychological test- try, in a counseling center, and other service-
ing with fair regularity. At the same time, I have oriented settings, and also because as a clini-
steadfastly refused to specialize and have had the cally trained academic psychologist I have done a
opportunity to publish in several different areas, considerable amount of research, I have tried to
to work in management consulting, to be director cover both sides of the coin – the basic research-
of a counseling center and of a clinical psychology oriented issues and the application of tests in
program, to establish an undergraduate honors service-oriented settings. Thus Parts One and
program, and to be involved in a wide variety of Two, the first eight chapters, serve as an introduc-
projects with students in nursing, rehabilitation, tion to basic concepts, issues, and approaches.
education, social work, and other fields. In all of Parts Three and Four, Chapters 9 through 15,
these activities, I have found psychological test- have a much more applied focus. Finally, we have
ing to be central and to be very challenging and attempted to integrate both classical approaches
exciting. and newer thinking about psychological testing.
In this book, we have tried to convey the excite- The area of psychological testing is fairly well
ment associated with psychological testing and to defined. I cannot imagine a textbook that does
teach basic principles through the use of con- not discuss such topics as reliability, validity,
crete examples. When specific tests are men- and norms. Thus, what distinguishes one text-
tioned, they are mentioned because they are used book from another is not so much its content
as an example to teach important basic princi- but more a question of balance. For example,
ples, or in some instances, because they occupy a most textbooks continue to devote one or more
central/historical position. No attempt has been chapters to projective techniques, even though
made to be exhaustive. their use and importance has decreased substan-
Much of what is contained in many testing tially. Projective techniques are important, not
textbooks is rather esoteric information, of use only from a historical perspective, but also for
only to very few readers. For example, most what they can teach us about basic issues in test-
textbooks include several formulas to compute ing. In this text, they are discussed and illustrated,
interitem consistency. It has been our experi- but as part of a chapter (see Chapter 15) within
ence, however, that 99% of the students who the broader context of testing in clinical settings.
take a course on testing will never have occa- Most textbooks also have several chapters on
sion to use such formulas, even if they enter a intelligence testing, often devoting considerable
career in psychology or allied fields. The very few space to such topics as the heritability of intelli-
who might need to do such calculations will do gence, theories of trait organization, longitudinal
them by computer or will know where to find studies of intelligence, and similar topics. Such
the relevant formulas. It is the principle that is topics are of course important and fascinating,
ix
P1: JZP
0521861810pre1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

x Preface

but do they really belong in a textbook on psy- this topic. Most textbooks begin with a historical
chological testing? If they do, then that means chapter. We have chosen to place this chapter last,
that some other topics more directly relevant to so the reader can better appreciate the historical
testing are omitted or given short shrift. In this background from a more knowledgeable point of
textbook, we have chosen to focus on testing and view.
to minimize the theoretical issues associated with Finally, rather than writing a textbook about
intelligence, personality, etc., except where they testing, we have attempted to write a textbook
may be needed to have a better understanding of about testing the individual. We believe that
testing approaches. most testing applications involve an attempt
It is no surprise that computers have had (and to use tests as a tool to better understand an
continue to have) a major impact on psycholog- individual, whether that person is a client in
ical testing, and so an entire chapter of this book therapy, a college student seeking career or
(Chapter 17) is devoted to this topic. There is academic guidance, a business executive wish-
also a vast body of literature and great student ing to capitalize on strengths and improve
interest on the topic of faking, and here too an on weaknesses, or a volunteer in a scientific
entire chapter (Chapter 16) has been devoted to experiment.
P1: JZP
0521861810pre1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Acknowledgments

In my career as a psychologist, I have had the brief, but his writings greatly influenced my own
excellent fortune to be mentored, directly and thinking.
indirectly, by three giants in the psychologi- On a personal note, I thank Valerie, my wife
cal testing field. The first is Harrison Gough, of 40 years, for her love and support, and for
my mentor in graduate school at Berkeley, who being the best companion one could hope for
showed me how useful and exciting psychological in this voyage we call life. Our three children
tests can be when applied to real-life problems. have been an enormous source of love and pride:
More importantly, Gough has continued to be Brian, currently a professor of philosophy at
not only a mentor but also a genuine model to be Miami University of Ohio; Marisa, a professor
emulated both as a psychologist and as a human of health economics at the University of North
being. Much of my thinking and approach to test- Carolina, Chapel Hill; and Marla, chief foren-
ing, as well as my major interest in students at all sic psychologist in the Department of Mental
levels, is a direct reflection of Gough’s influence. Health of South Carolina, and co-author of this
The second was Anne Anastasi, a treasured col- edition. Zeno and Paolo, our two grandchildren,
league at Fordham University, a generous friend, are unbelievably smart, handsome, and adorable
and the best chairperson I have ever worked with. and make grandparenting a joy. I have also been
Her textbook has been truly a model of schol- truly blessed with exceptional friends whose love
arship and concise writing, the product of an and caring have enriched my life enormously.
extremely keen mind who advanced the field of
George Domino
psychological testing in many ways.
Tucson, AZ
The third person was Lee J. Cronbach of Stan-
ford University. My first undergraduate exposure
to testing was through his textbook. In 1975, An abundance of gratitude to my father for giv-
Cronbach wrote what is now a classic paper titled, ing me the opportunity to collaborate with one
“Beyond the two disciplines of scientific psychol- of the greatest psychologists ever known. And
ogy” (American Psychologist, 1975, vol. 30, pp. an immeasurable amount of love and respect
116–127), in which he argued that experimen- to my heroes – my Dad and Mom. I would
tal psychology and the study of individual differ- also like to thank my mentor and friend, Stan
ences should be integrated. In that paper, Cron- Brodsky, whose professional accomplishments
bach was kind enough to cite at some length two are only surpassed by his warmth, kindness, and
of my studies on college success as examples of generous soul.
this integration. Subsequently I was able to invite
him to give a colloquium at the University of Marla Domino
Arizona. My contacts with him were regrettably Columbia, SC

xi
P1: JZP
0521861810pre1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

xii
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

PART ONE: BASIC ISSUES

1 The Nature of Tests

AIM In this chapter we cover four basic issues. First, we focus on what is a test, not just
a formal definition, but on ways of thinking about tests. Second, we try to develop a
“taxonomy” of tests, that is we look at various ways in which tests can be categorized.
Third, we look at the ethical aspects of psychological testing. Finally, we explore how
we can obtain information about a specific test.

INTRODUCTION of experiment, the experimenter studies a phe-


nomenon and observes the results, while at the
Most likely you would have no difficulty identi-
same time keeping in check all extraneous vari-
fying a psychological test, even if you met one in
ables so that the results can be ascribed to a par-
a dark alley. So the intent here is not to give you
ticular antecedent cause. In psychological testing,
one more definition to memorize and repeat but
however, it is usually not possible to control all
rather to spark your thinking.
the extraneous variables, but the metaphor here
is a useful one that forces us to focus on the stan-
What is a test? Anastasi (1988), one of the dardized procedures, on the elimination of con-
best known psychologists in the field of testing, flicting causes, on experimental control, and on
defined a test as an “objective” and “standard- the generation of hypotheses that can be further
ized” measure of a sample of behavior. This is investigated. So if I administer a test of achieve-
an excellent definition that focuses our attention ment to little Sandra, I want to make sure that
on three elements: (1) objectivity: that is, at least her score reflects what she has achieved, rather
theoretically, most aspects of a test, such as how than her ability to follow instructions, her degree
the test is scored and how the score is interpreted, of hunger before lunch, her uneasiness at being
are not a function of the subjective decision of a tested, or some other influence.
particular examiner but are based on objective A second way to consider a test is to think of a
criteria; (2) standardization: that is, no matter test as an interview. When you are administered
who administers, scores, and interprets the test, an examination in your class, you are essentially
there is uniformity of procedure; and (3) a sample being interviewed by the instructor to determine
of behavior: a test is not a psychological X-ray, nor how well you know the material. We discuss inter-
does it necessarily reveal hidden conflicts and for- views in Chapter 18, but for now consider the
bidden wishes; it is a sample of a person’s behav- following: in most situations we need to “talk”
ior, hopefully a representative sample from which to each other. If I am the instructor, I need to
we can draw some inferences and hypotheses. know how much you have learned. If I am hiring
There are three other ways to consider psycho- an architect to design a house or a contractor to
logical tests that we find useful and we hope you build one, I need to evaluate their competency,
will also. One way is to consider the administra- and so on. Thus “interviews” are necessary, but
tion of a test as an experiment. In the classical type a test offers many advantages over the standard
1
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

2 Part One. Basic Issues

interview. With a test I can “interview” 50 or simply a set of printed items requiring some type
5,000 persons at one sitting. With a test I can be of written response.
much more objective in my evaluation because
for example, multiple-choice answer sheets do Testing vs. assessment. Psychological assessment
not discriminate on the basis of gender, ethnic- is basically a judgmental process whereby a broad
ity, or religion. range of information, often including the results
A third way to consider tests is as tools. Many of psychological tests, is integrated into a mean-
fields of endeavor have specific tools – for exam- ingful understanding of a particular person. If
ple, physicians have scalpels and X-rays, chemists that person is a client or patient in a psychother-
have Bunsen burners and retorts. Just because apeutic setting, we call the process clinical assess-
someone can wield a scalpel or light up a Bunsen ment. Psychological testing is thus a narrower
burner does not make him or her an “expert” in concept referring to the psychometric aspects
that field. The best use of a tool is in the hands of of a test (the technical information about the
a trained professional when it is simply an aid to test), the actual administration and scoring of the
achieve a particular goal. Tests, however, are not test, and the interpretation made of the scores.
just psychological tools; they also have political We could of course assess a client simply by
and social repercussions. For example, the well- administering a test or battery (group) of tests.
publicized decline in SAT scores (Wirtz & Howe, Usually the assessing psychologist also inter-
1977) has been used as an indicator of the terri- views the client, obtains background informa-
ble shape our educational system is in (National tion, and where appropriate and feasible, infor-
Commission, 1983). mation from others about the client [see Korchin,
1976, for an excellent discussion of clinical assess-
A test by any other name. . . . In this book, we ment, and G. J. Meyer, Finn, Eyde, et al. (2001)
use the term psychological test (or more briefly for a brief overview of assessment].
test) to cover those measuring devices, tech-
niques, procedures, examinations, etc., that in Purposes of tests. Tests are used for a wide vari-
some way assess variables relevant to psycholog- ety of purposes that can be subsumed under more
ical functioning. Some of these variables, such as general categories. Many authors identify four
intelligence, introversion-extraversion, and self- categories typically labeled as: classification, self-
esteem are clearly “psychological” in nature. Oth- understanding, program evaluation, and scientific
ers, such as heart rate or the amount of pal- inquiry.
mar perspiration (the galvanic skin response), Classification involves a decision that a par-
are more physiological but are related to psy- ticular person belongs in a certain category. For
chological functioning. Still other variables, such example, based on test results we may assign a
as socialization, delinquency, or leadership, may diagnosis to a patient, place a student in the intro-
be somewhat more “sociological” in nature, but ductory Spanish course rather than the interme-
are of substantial interest to most social and diate or advanced course, or certify that a person
behavioral scientists. Other variables, such as has met the minimal qualifications to practice
academic achievement, might be more relevant medicine.
to educators or professionals working in edu- Self-understanding involves using test infor-
cational settings. The point here is that we mation as a source of information about oneself.
use the term psychological in a rather broad Such information may already be available to the
sense. individual, but not in a formal way. Marlene, for
Psychological tests can take a variety of forms. example, is applying to graduate studies in elec-
Some are true-false inventories, others are rat- trical engineering; her high GRE scores confirm
ing scales, some are actual tests, whereas others what she already knows, that she has the potential
are questionnaires. Some tests consist of mate- abilities required for graduate work.
rials such as inkblots or pictures to which the Program evaluation involves the use of tests
subject responds verbally; still others consist of to assess the effectiveness of a particular pro-
items such as blocks or pieces of a puzzle that the gram or course of action. You have probably seen
subject manipulates. A large number of tests are in the newspaper, tables indicating the average
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

The Nature of Tests 3

achievement test scores for various schools in Rorschach Inkblot test. Subsequently they were
your geographical area, with the scores often tested with the Rorschach and the responses
taken, perhaps incorrectly, as evidence of the clearly showed a suggestive influence because of
competency level of a particular school. Pro- the prior readings. Ironson and Davis (1979)
gram evaluation may involve the assessment of administered a test of creativity three times, with
the campus climate at a particular college, or instructions to “fake creative,” “fake uncreative,”
the value of a drug abuse program offered by a or “be honest”; the obtained scores reflected the
mental health clinic, or the effectiveness of a new influence of the instructions. On the other hand,
medication. Sattler and Theye (1967) indicated that of twelve
Tests are also used in scientific inquiry. If you studies reviewed, which departed from standard
glance through most professional journals in the administrative procedures, only five reported sig-
social and behavioral sciences, you will find that nificant differences between standard and non-
a large majority of studies use psychological tests standard administration.
to operationally define relevant variables and to 2. Situational variables. These include a vari-
translate hypotheses into numerical statements ety of aspects that presumably can alter the test
that can be assessed statistically. Some argue that situation significantly, such as a subject feeling
development of a field of science is, in large part, frustrated, discouraged, hungry, being under the
a function of the available measurement tech- influence of drugs, and so on. Some of these vari-
niques (Cone & Foster, 1991; Meehl, 1978). ables can have significant effects on test scores,
but the effects are not necessarily the same for all
Tests as experimental procedure. If we accept subjects. For example, Sattler and Theye (1967)
the analogy that administering a test is very report that discouragement affects the perfor-
much like an experiment, then we need to make mance of children but not of college students on
sure that the experimental procedure is followed some intelligence tests.
carefully and that extraneous variables are not
3. Experimenter variables. The testing situation
allowed to influence the results. This means, for
is a social situation, and even when the test is
example, that instructions and time limits need
administered by computer, there is clearly an
to be adhered to strictly. The greater the control
experimenter, a person in charge. That person
that can be exercised on all aspects of a test situ-
may exhibit characteristics (such as age, gender,
ation, the lesser the influence of extraneous vari-
and skin color) that differ from those of the sub-
ables. Thus the scoring of a multiple-choice exam
ject. The person may appear more or less sym-
is less influenced by such variables as clarity of
pathetic, warm or cold, more or less authoritar-
handwriting than the scoring of an essay exam; a
ian, aloof, more adept at establishing rapport,
true-false personality inventory with simple
etc. These aspects may or may not affect the sub-
instructions is probably less influenced than an
ject’s test performance; the results of the avail-
intelligence test with detailed instructions.
able experimental evidence are quite complex
Masling (1960) reviewed a variety of studies
and not easily summarized. We can agree with
of variables that can influence a testing situation,
Sattler and Theye (1967), who concluded that the
in this case “projective” testing (see Chapter 15);
experimenter-subject relationship is important
Sattler and Theye (1967) did the same for intel-
and that (perhaps) less qualified experimenters
ligence tests. We can identify, as Masling (1960)
do not obtain appreciably different results than
did, four categories of such variables:
more qualified experimenters. Whether the race,
1. The method of administration. Standard ethnicity, physical characteristics, etc., of the
administration can be altered by disregarding or experimenter significantly affect the testing situ-
changing instructions, by explicitly or implic- ation seems to depend on a lot of other variables
itly giving the subject a set to answer in a cer- and, in general, do not seem to be as powerful an
tain way, or by not following standard proce- influence as many might think.
dures. For example, Coffin (1941) had subjects 4. Subject variables. Do aspects of the subject,
read fictitious magazine articles indicating what such as level of anxiety, physical attractiveness,
were more socially acceptable responses to the etc., affect the testing situation? Masling (1960)
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

4 Part One. Basic Issues

used attractive female accomplices who, as test is rare for such decisions to be based solely on
subjects, acted “warm” or “cold” toward the test data. Yet in many situations, test data rep-
examiners (graduate students). The test results resent the only source of objective data standard
were interpreted by the graduate students more for all candidates; other sources of data such as
favorably when the subject acted warm than interviews, grades, and letters of recommenda-
when she acted cold. tion are all “variable” – grades from different
schools or different instructors are not compara-
In general what can we conclude? Aside from
ble, nor are letters written by different evaluators.
the fact that most studies in this area seem to
Finally, as scientists, we should ask what is the
have major design flaws and that many specific
empirical evidence for the accuracy of predicting
variables have not been explored consistently,
future behavior. That is, if we are admitting col-
Masling (1960) concluded that there is strong evi-
lege students to a particular institution, which
dence of situational and interpersonal influences
sources of data, singly or in combination, such
in projective testing, while Sattler and Theye
as interviewers’ opinions, test scores, high school
(1967) concluded that:
GPA, etc., would be most accurate in making rel-
1. Departures from standard procedures are evant predictions, such as, “Let’s admit Marlene
more likely to affect “specialized” groups, such because she will do quite well academically.” We
as children, schizophrenics, and juvenile delin- will return to this issue, but for now let me indi-
quents than “normal” groups such as college cate a general psychological principle that past
students; behavior is the best predictor of future behav-
2. Children seem to be more susceptible to situ- ior, and a corollary that the results of psycholog-
ational factors, especially discouragement, than ical tests can provide very useful information on
are college-aged adults; which to make more accurate future predictions.
3. Rapport seems to be a crucial variable, while
degree of experience of the examiner is not; Relation of test content to predicted behavior.
Rebecca is enrolled in an introductory Spanish
4. Racial differences, specifically a white exam-
course and is given a Spanish vocabulary test
iner and a black subject, may be important, but
by the instructor. Is the instructor interested in
the evidence is not definitive.
whether Rebecca knows the meaning of the spe-
cific words on the test? Yes indeed, because the
Tests in decision making. In the real world, deci- test is designed to assess Rebecca’s mastery of
sions need to be made. To allow every person the vocabulary covered in class and in homework
who applies to medical school to be admitted assignments. Consider now a test such as the SAT,
would not only create huge logistical problems, given for college admission purposes. The test
but would result in chaos and in a situation may contain a vocabulary section, but the
that would be unfair to the candidates them- concern is not whether an individual knows
selves, some of whom would not have the intel- the particular words; knowledge of this sample
lectual and other competencies required to be of words is related to something else, namely
physicians, to the medical school faculty whose doing well academically in college. Finally, con-
teaching efforts would be diluted by the pres- sider a third test, the XYZ scale of depression.
ence of unqualified candidates, and eventually to Although the scale contains no items about sui-
the public who might be faced with incompetent cide ideation, it has been discovered empirically
physicians. that high scorers on this scale are likely to attempt
Given that decisions need to be made, we suicide. These three examples illustrate an impor-
must ask what role psychological tests can play in tant point: In psychological tests, the content of
such decision making. Most psychologists agree the test items may or may not cover the behav-
that major decisions should not be based on ior that is of interest – there may be a lack of
the results of a single test administration, that correspondence between test items and the pre-
whether or not state university admits Sandra dicted behavior. But a test can be quite useful if
should not be based solely on her SAT scores. an empirical correspondence between test scores
In fact, despite a stereotype to the contrary, it and real-life behavior can be shown.
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

The Nature of Tests 5

CATEGORIES OF TESTS published tests, the MMY will provide a brief


description of the test (its purpose, applicable age
Because there are thousands of tests, it would be
range, type of score generated, price, administra-
helpful to be able to classify tests into categories,
tion time, and name and address of publisher), a
just as a bookstore might list its books under dif-
bibliography of citations relevant to the test, and
ferent headings. Because tests differ from each
one or more reviews of the test by test experts.
other in a variety of ways, there is no uniformly
Tests that are reviewed in one edition of the MMY
accepted system of classification. Therefore, we
may or may not be reviewed in subsequent edi-
will invent our own based on a series of questions
tions, so locating information about a specific
that can be asked of any test. I should point out
test may involve browsing through a number of
that despite a variety of advances in both theory
editions. MMY reviews of specific tests are also
and technique, standardized tests have changed
available through a computer service called the
relatively little over the years (Linn, 1986), so
Bibliographic Retrieval Services.
while new tests are continually published, a classi-
If the test you are interested in learning about is
ficatory system should be fairly stable, i.e., appli-
not commercially published, it will probably have
cable today as well as 20 years from now.
an author(s) who published an article about the
test in a professional journal. The journal arti-
Commercially published? The first question is cle will most likely give the author’s address at
whether a test is commercially published (some- the time of publication. If you are a “legitimate”
times called a proprietary test) or not. Major test user, for example a graduate student doing a
tests like the Stanford-Binet and the Minnesota doctoral dissertation or a psychologist engaged in
Multiphasic Personality Inventory are available research work, a letter to the author will usually
for purchase by qualified users through commer- result in a reply with a copy of the test and per-
cial companies. The commercial publisher adver- mission to use it. If the author has moved from
tises primarily through its catalog, and for many the original address, you may locate the current
tests makes available, for a fee, a specimen set, usu- address through various directories and “Who’s
ally the test booklet and answer sheet, a scoring Who” type of books, or through computer gen-
key to score the test, and a test manual that con- erated literature searches.
tains information about the test. If a test is not
commercially published, then a copy is ordinarily Administrative aspects. Tests can also be distin-
available from the test author, and there may be guished by various aspects of their administra-
some accompanying information, or perhaps just tion. For example, there are group vs. individual
the journal article where the test was first intro- tests; group tests can be administered to a group
duced. Sometimes journal articles include the of subjects at the same time and individual tests to
original test, particularly if it is quite short, but one person only at one time. The Stanford-Binet
often they will not. (Examples of articles that con- test of intelligence is an individual test, whereas
tain test items are R. L. Baker, Mednick & Hoce- the SAT is a group test. Clinicians who deal with
var, 1991; L. R. Good & K. C. Good, 1974; McLain, one client at a time generally prefer individual
1993; Rehfisch, 1958a; Snell, 1989; Vodanovich tests because these often yield observational data
& Kass, 1990). Keep in mind that the contents of in addition to a test score; researchers often need
journal articles are copyright and permission to to test large groups of subjects in minimum time
use a test must be obtained from both the author and may prefer group tests (there are of course,
and the publisher. many exceptions to this statement). A group test
If you are interested in learning more about can be administered to one individual; some-
a specific test, first you must determine if the times, an individual test can be modified so it
test is commercially published. If it is, then you can be administered to a group.
will want to consult the Mental Measurements Tests can also be classified as speed vs. power
Yearbook (MMY), available in most university tests. Speed tests have a time limit that affects
libraries. Despite its name, the MMY is published performance; for example, you might be given a
at irregular intervals rather than yearly. However, page of printed text and asked to cross out all the
it is an invaluable guide. For many commercially “e’s” in 25 seconds. How many you cross out will
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

6 Part One. Basic Issues

be a function of how fast you respond. A power more,” “how do you feel about that?” and “tell me
test, on the other hand, is designed to measure about yourself.” In between, we have countless
how well you can do and so either may have variations such as matching items (closer to the
no time limit or a time limit of convenience (a objective pole) and essay questions (closer to the
50-minute hour) that ordinarily does not affect subjective pole). Objective items are easy to score
performance. The time limits on speed tests are and to manipulate statistically, but individually
usually set so that only 50% of the applicants are reveal little other than that the person answered
able to attempt every item. Time limits on power correctly or incorrectly. Subjective items are
tests are set so that about 90% of the applicants difficult and sometimes impossible to quantify,
can attempt all items. but can be quite a revealing and rich source of
Another administrative distinction is whether information.
a test is a secure test or not. For example, the SAT Another possible distinction in item struc-
is commercially published but is ordinarily not ture is whether the items are verbal in nature or
made available even to researchers. Many tests require performance. Vocabulary and math items
that are used in industry for personnel selection are labeled verbal because they are composed of
are secure tests whose utility could be compro- verbal elements; building a block tower is a per-
mised if they were made public. Sometimes only formance item.
the scoring key is confidential, rather than the
items themselves. Area of assessment. Tests can also be classified
A final distinction from an administrative according to the area of assessment. For exam-
point of view is how invasive a test is. A ques- ple, there are intelligence tests, personality ques-
tionnaire that asks about one’s sexual behaviors is tionnaires, tests of achievement, career-interest
ordinarily more invasive than a test of arithmetic; tests, tests of reading, tests of neuropsychological
a test completed by the subject is usually more functioning, and so on. The MMY uses 16 such
invasive than a report of an observer, who may categories. These are not necessarily mutually
report the observations without even the subject’s exclusive categories, and many of them can be fur-
awareness. ther subdivided. For example, tests of personality
could be further categorized into introversion-
The medium. Tests differ widely in the materi- extraversion, leadership, masculinity-femininity,
als used, and so we can distinguish tests on this and so on.
basis. Probably, the majority of tests are paper- In this textbook, we look at five major cate-
and-pencil tests that involve some set of printed gories of tests:
questions and require a written response, such as
1. Personality tests, which have played a major
marking a multiple answer sheet. Other tests are
role in the development of psychological testing,
performance tests that perhaps require the manip-
both in its acceptance and criticism. Personality
ulation of wooden blocks or the placement of
represents a major area of human functioning for
puzzle pieces in correct juxtaposition. Still other
social-behavioral scientists and lay persons alike;
tests involve physiological measures such as the
galvanic skin response, the basis of the polygraph 2. Tests of cognitive abilities, not only tradi-
(lie detector) machine. Increasing numbers of tional intelligence tests, but other dimensions
tests are now available for computer administra- of cognitive or intellectual functioning. In some
tion and this may become a popular category. ways, cognitive psychology represents a major
new emphasis in psychology which has had a sig-
Item structure. Another way to classify tests, nificant impact on all aspects of psychology both
which overlaps with the approaches already men- as a science and as an applied field;
tioned, is through their item structure. Test items 3. Tests of attitudes, values, and interests, three
can be placed on a continuum from objective to areas that psychometrically overlap, and also
subjective. At the objective end, we have multiple- offer lots of basic testing lessons;
choice items; at the subjective end, we have the 4. Tests of psychopathology, primarily those used
type of open-ended questions that clinical psy- by clinicians and researchers to study the field of
chologists and psychiatrists ask, such as “tell me mental illness; and
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

The Nature of Tests 7

5. Tests that assess normal and positive func- not to the test but to how the score or perfor-
tioning, such as creativity, competence, and self- mance is interpreted. The same test could yield
esteem. either or both score interpretations.
Another distinction that can be made is
Test function. Tests can also be categorized whether the measurement provided by the test
depending upon their function. Some tests are is normative or ipsative, that is, whether the stan-
used to diagnose present conditions. (Does the dard of comparison reflects the behavior of others
client have a character disorder? Is the client or of the client. Consider a 100-item vocabulary
depressed?) Other tests are used to make pre- test that we administer to Marisa, and she obtains
dictions. (Will this person do well in college? Is a score of 82. To make sense of that score, we
this client likely to attempt suicide?) Other tests compare her score with some normative data –
are used in selection procedures, which basically for example, the average score of similar-aged col-
involve accepting or not accepting a candidate, as lege students. Now consider a questionnaire that
in admission to graduate school. Some tests are asks Marisa to decide which of two values is more
used for placement purposes – candidates who important to her: “Is it more important for you
have been accepted are placed in a particular to have (1) a good paying job, or (2) freedom to
“treatment.” For example, entering students at do what you wish.” We could compare her choice
a university may be placed in different level writ- with that of others, but in effect we have simply
ing courses depending upon their performance asked her to rank two items in terms of her own
in a writing exam. A battery of tests may be used preferences or her own behavior; in most cases it
to make such a placement decision or to assess would not be legitimate to compare her ranking
which of several alternatives is most appropriate with those of others. She may prefer choice num-
for the particular client – here the term typically ber 2, but not by much, whereas for me choice
used is classification (note that this term has both number 2 is a very strong preference.
a broader meaning and a narrower meaning). One way of defining ipsative is that the scores
Some tests are used for screening purposes; the on the scale must sum to a constant. For exam-
term screening implies a rapid and rough proce- ple, if you are presented with a set of six
dure. Some tests are used for certification, usu- ice cream flavors to rank order as to prefer-
ally related to some legal standard; thus passing ence, no matter whether your first preference is
a driving test certifies that the person has, at the “crunchy caramel” or “Bohemian tutti-frutti,”
very least, a minimum proficiency and is allowed the sum of your six preferences will be 21
to drive an automobile. (1+2+3+4+5+6). On the other hand, if you
were asked to rate each flavor independently on
Score interpretation. Yet another classification a 6-point scale, you could rate all of them high or
can be developed on the basis of how scores on all of them low; this would be a normative scale.
a test are interpreted. We can compare the score Another way to define ipsative is to focus on the
that an individual obtains with the scores of a idea that in ipsative measurement, the mean is
group of individuals who also took the same test. that of the individual, whereas in normative mea-
This is called a norm-reference because we refer surement the mean is that of the group. Ipsative
to norms to give a particular score meaning; for measurement is found in personality assessment;
most tests, scores are interpreted in this manner. we look at a technique called Q sort in Chapter 18.
We can also give meaning to a score by compar- Block (1957) found that ipsative and normative
ing that score to a decision rule called a criterion, ratings of personality were quite equivalent.
so this would be a criterion-reference. For exam- Another classificatory approach involves
ple, when you took a driving test (either written whether the responses made to the test are inter-
and/or road), the examiner did not say, “Con- preted psychometrically or impressionistically. If
gratulations your score is two standard devia- the responses are scored and the scores inter-
tions above the mean.” You either passed or failed preted on the basis of available norms and/or
based upon some predetermined criterion that research data, then the process is a psychometric
may or may not have been explicitly stated. Note one. If instead the tester looks at the responses
that norm-reference and criterion-reference refer carefully on the basis of his/her expertise and
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

8 Part One. Basic Issues

creates a psychological portrait of the client, The NOIR system. One classificatory schema
that process is called impressionistic. Sometimes that has found wide acceptance is to classify tests
the two are combined; for example, clinicians according to their measurement properties. All
who use the Minnesota Multiphasic Personal- measuring instruments, whether a psychological
ity Inventory (MMPI), score the test and plot test, an automobile speedometer, a yardstick, or a
the scores on a profile, and then use the pro- bathroom scale, can be classified into one of four
file to translate their impressions into diagnostic types based on the numerical properties of the
and characterological statements. Impressionis- instrument:
tic testing is more prevalent in clinical diagnosis
and the assessment of psychodynamic function- 1. Nominal scales. Here the numbers are used
ing than, say, in assessing academic achievement merely as labels, without any inherent numeri-
or mechanical aptitude. cal property. For example, the numbers on the
uniforms of football players represent such a use,
Self-report versus observer. Many tests are self- with the numbers useful to distinguish one player
report tests where the client answers questions from another, but not indicative of any numerical
about his/her own behavior, preferences, values, property – number 26 is not necessarily twice as
etc. However, some tests require judging some- good as number 13, and number 92 is not neces-
one else; for example, a manager might rate each sarily better or worse than number 91. In psycho-
of several subordinates on promptness, indepen- logical testing, we sometimes code such variables
dence, good working habits, and so on. as religious preference by assigning numbers to
preferences, such as 1 to Protestant, 2 to Catholic,
Maximal vs. typical performance. Yet another 3 to Jewish, and so on. This does not imply that
distinction is whether a test assesses maximal per- being a Protestant is twice as good as being a
formance (how well a person can do) or typical Catholic, or that a Protestant plus a Catholic
performance (how well the person typically does) equal a Jew. Clearly, nominal scales represent a
(Cronbach, 1970). Tests of maximal performance rather low level of measurement, and we should
usually include achievement and aptitude tests not apply to these scales statistical procedures
and typically based on items that have a correct such as computing a mean.
answer. Typical performance tests include per- 2. Ordinal scales. These are the result of ranking.
sonality inventories, attitude scales, and opinion Thus if you are presented with a list of ten cities
questionnaires, for which there are no correct and asked to rank them as to favorite vacation site,
answers. you have an ordinal scale. Note that the results of
an ordinal scale indicate rankings but not differ-
Age range. We can classify tests according to ences in such rankings. Mazatlan in Mexico may
the age range for which they are most appropri- be your first choice, with Palm Springs a close
ate. The Stanford-Binet, for example, is appro- second; but Toledo, your third choice, may be a
priate for children but less so for adults; the SAT “distant” third choice.
is appropriate for adolescents and young adults 3. Interval scales. These use numbers in such a
but not for children. Tests are used with a wide way that the distance among different scores are
variety of clients and we focus particularly on based on equal units, but the zero point is arbi-
children (Chapter 9), the elderly (Chapter 10), trary. Let’s translate that into English by consid-
minorities and individuals in different cultures ering the measurement of temperature. The dif-
(Chapter 11), and the handicapped (Chapter 12). ference between 70 and 75 degrees is five units,
which is the same difference as between 48 and
Type of setting. Finally, we can classify tests 53 degrees. Each degree on our thermometer is
according to the setting in which they are primar- equal in size. Note however that the zero point,
ily used. Tests are used in a wide variety of set- although very meaningful, is in fact arbitrary;
tings, but the most prevalent are school settings zero refers to the freezing of water at sea level –
(Chapter 13), occupational and military settings we could have chosen the freezing point of soda
(Chapter 14), and “mental health” settings such on top of Mount McKinley or some other stan-
as clinics, courts of law, and prisons (Chapter 15). dard. Because the zero point is arbitrary we
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

The Nature of Tests 9

cannot make ratios, and we cannot say that a ETHICAL STANDARDS


temperature of 100 degrees is twice as hot as a
Tests are tools used by professionals to make what
temperature of 50 degrees.
may possibly be some serious decisions about a
Let’s consider a more psychological example.
client; thus both tests and the decision process
We have a 100-item multiple-choice vocabulary
involve a variety of ethical considerations to make
test composed of items such as:
sure that the decisions made are in the best inter-
est of all concerned and that the process is carried
cat = (a) feline, (b) canine, (c) aquiline, (d) asinine
out in a professional manner. There are serious
Each item is worth 1 point and we find that Susan concerns, on the part of both psychologists and
obtains a score of 80 and Barbara, a score of 40. lay people, about the nature of psychological test-
Clearly, Susan’s performance on the test is bet- ing and its potential misuse, as well as demands
ter than Barbara’s, but is it twice as good? What for increased use of tests.
if the vocabulary test had contained ten addi-
tional easy items that both Susan and Barbara APA ethics code. The American Psychological
had answered correctly; now Susan’s score would Association has since 1953 published and revised
have been 90 and Barbara’s score 50, and clearly ethical standards, with the most recent publica-
90 is not twice 50. A zero score on this test does tion of Ethical Principles of Psychologists and Code
not mean that the person has zero vocabulary, of Conduct in 1992. This code of ethics also gov-
but simply that they did not answer any of the erns, both implicitly and explicitly, a psycholo-
items correctly – thus the zero is arbitrary and we gist’s use of psychological tests.
cannot arrive at any conclusions that are based on The Ethics Code contains six general
ratios. principles:
In this connection, I should point out that we
1. Competence: Psychologists maintain high
might question whether our vocabulary test is in
standards of competence, including knowing
fact an interval scale. We score it as if it were, by
their own limits of expertise. Applied to testing,
assigning equal weights to each item, but are the
this might suggest that it is unethical for the psy-
items really equal? Most likely no, since some of
chologist to use a test with which he or she is not
the vocabulary items might be easier and some
familiar to make decisions about clients.
might be more difficult. I could, of course, empir-
ically determine their difficulty level (we discuss 2. Integrity: Psychologists seek to act with
this in Chapter 2) and score them appropriately integrity in all aspects of their professional roles.
(a real difficult item might receive 9 points, a As a test author for example, a psychologist
medium difficulty item 5, and so on), or I could should not make unwarranted claims about a
use only items that are of approximately equal particular test.
difficulty or, as is often done, I can assume (typ- 3. Professional and scientific responsibility: Psy-
ically incorrectly) that I have an interval scale. chologists uphold professional standards of con-
4. Ratio scales. Finally, we have ratio scales that duct. In psychological testing this might require
not only have equal intervals but also have a knowing when test data can be useful and when it
true zero. The Kelvin scale of temperature, which cannot. This means, in effect, that a practitioner
chemists use, is a ratio scale and on that scale a using a test needs to be familiar with the research
temperature of 200 is indeed twice as hot as a literature on that test.
temperature of 100. There are probably no psy- 4. Respect for people’s rights and dignity: Psy-
chological tests that are true ratio scales, but most chologists respect the privacy and confidential-
approximate interval scales; that is, they really are ity of clients and have an awareness of cultural,
ordinal scales but we treat them as if they were religious, and other sources of individual differ-
interval scales. However, newer theoretical mod- ences. In psychological testing, this might include
els known as item-response theory (e.g., Lord, an awareness of when a test is appropriate for use
1980; Lord & Novick, 1968; Rasch, 1966; D. J. with individuals who are from different cultures.
Weiss & Davison, 1981) have resulted in ways of 5. Concern for others’ welfare: Psychologists
developing tests said to be ratio scales. are aware of situations where specific tests (for
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

10 Part One. Basic Issues

example, ordered by the courts) may be detri- These standards are quite comprehensive and
mental to a particular client. How can these situ- cover (1) technical issues of validity, reliability,
ations be resolved so that both the needs of society norms, etc.; (2) professional standards for test
and the welfare of the individual are protected? use, such as in clinical and educational settings;
6. Social responsibility: Psychologists have pro- (3) standards for particular applications such as
fessional and scientific responsibilities to com- testing linguistic minorities; and (4) standards
munity and society. With regard to psychological that cover aspects of test administration, the
testing, this might cover counseling against the rights of the test taker and so on.
misuse of tests by the local school. In considering the ethical issues involved in
psychological testing, three areas seem to be of
In addition to these six principles, there are paramount importance: informed consent, con-
specific ethical standards that cover eight cat- fidentiality, and privacy.
egories, ranging from “General standards” to Informed consent means that the subject has
“Resolving ethical issues.” The second cate- been given the relevant information about the
gory is titled, “Evaluation, assessment, or inter- testing situation and, based on that information,
vention” and is thus the area most explicitly consents to being tested. Obviously this is a the-
related to testing; this category covers 10 specific oretical standard that in practice requires careful
standards: and thoughtful application. Clearly, to inform a
subject that the test to be taken is a measure of
1. Psychological procedures such as testing, eval-
“interpersonal leadership” may result in a set to
uation, diagnosis, etc., should occur only within
respond in a way that can distort and perhaps
the context of a defined professional relationship.
invalidate the test results. Similarly, most sub-
2. Psychologists only use tests in appropriate jects would not understand the kind of techni-
ways. cal information needed to scientifically evaluate
3. Tests are to be developed using acceptable sci- a particular test. So typically, informed consent
entific procedures. means that the subject has been told in general
4. When tests are used, there should be familiar- terms what the purpose of the test is, how the
ity with and awareness of the limitations imposed results will be used, and who will have access to
by psychometric issues, such as those discussed the test protocol.
in this textbook. The issue of confidentiality is perhaps even
5. Assessment results are to be interpreted in light more complex. Test results are typically consid-
of the limitations inherent in such procedures. ered privileged communication and are shared
6. Unqualified persons should not use psycho- only with appropriate parties. But what is
logical assessment techniques. appropriate? Should the client have access to the
actual test results elucidated in a test report? If
7. Tests that are obsolete and outdated should
the client is a minor, should parents or legal
not be used.
guardians have access to the information? What
8. The purpose, norms, and other aspects of a about the school principal? What if the client
test should be described accurately. was tested unwillingly, when a court orders such
9. Appropriate explanations of test results should testing for determination of psychological san-
be given. ity, pathology that may pose a threat to others,
10. The integrity and security of tests should be or the risk of suicide, etc. When clients seek psy-
maintained. chological testing on their own, for example a
college student requesting career counseling at
Standards for educational and psychological the college counseling center, the guidelines are
tests. In addition to the more general ethical fairly clear. Only the client and the professional
standards discussed above, there are also spe- have access to the test results, and any transmis-
cific standards for educational and psychological sion of test results to a third party requires writ-
tests (American Educational Research Associa- ten consent on the part of the client. But real-
tion, 1999), first published in 1954, and subse- life issues often have a way of becoming more
quently revised a number of times. complex.
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

The Nature of Tests 11

The right to privacy basically concerns the will- Test levels. If one considers tests as tools to be
ingness of a person to share with others personal used by professionals trained in their use, then it
information, whether that information be fac- becomes quite understandable why tests should
tual or involve feelings and attitudes. In many not be readily available to unqualified users. In
tests, especially personality tests, the subject is fact, the APA proposed many years ago a rating
asked to share what may be very personal infor- system of three categories of tests: level A tests
mation, occasionally without realizing that such require minimal training, level B tests require
sharing is taking place. At the same time, the sub- some advanced training, and level C tests require
ject cannot be instructed that, “if you answer true substantial professional expertise. These guide-
to item #17, I will take that as evidence that you are lines are followed by many test publishers who
introverted.” often require that prospective customers fill out
What is or is not invasion of privacy may be a a registration form indicating their level of exper-
function of a number of aspects. A person seek- tise to purchase specific tests.
ing the help of a sex therapist may well expect There is an additional reason why the avail-
and understand the need for some very per- ability of tests needs to be controlled and that is
sonal questions about his or her sex life, while for security. A test score should reflect the dimen-
a student seeking career counseling would not sion being measured, for example, knowledge of
expect to be questioned about such behavior (for elementary geography, rather than some other
a detailed analysis of privacy as it relates to psy- process such as knowledge of the right answers.
chological testing see Ruebhausen & Brim, 1966; As indicated earlier, some tests are highly secured
for some interesting views on privacy, includ- and their use is tightly controlled; for example
ing Congressional hearings, see the November tests like the SAT or the GRE are available only
1965 and May 1966 issues of the American to those involved in their administration, and a
Psychologist). strict accounting of each test booklet is required.
Mention might also be made of feedback, pro- Other tests are readily available, and their item
viding and explaining test results to the client. content can sometimes be found in professional
Pope (1992) suggests that feedback may be journals or other library documents.
the most neglected aspect of assessment, and
describes feedback as a dynamic, interactive pro-
cess, rather than a passive, information-giving INFORMATION ABOUT TESTS
process. It would be nice if there were one central source,
The concern for ethical behavior is a perva- one section of the library, that would give us
sive aspect of the psychological profession, but all the information we needed about a partic-
one that lay people often are not aware of. Stu- ular test – but there isn’t. You should realize that
dents, for example, at times do not realize that libraries do not ordinarily carry specimen copies
their requests (“can I have a copy of the XYZ of tests. Not only are there too many of them
intelligence test to assess my little brother”) could and they easily get out of date, but such a depos-
involve unethical behavior. itory would raise some serious ethical questions.
In addition to the two major sets of ethical There may be offices on a college campus, such as
standards discussed above, there are other perti- the Counseling Center or the Clinical Psychology
nent documents. For example, there are guide- program, that have a collection of tests with scor-
lines for providers of psychological services to ing keys, manuals, etc., but these are not meant
members of populations whose ethnic, linguis- for public use. Information about specific tests
tic, or cultural background are diverse (APA, is scattered quite widely, and often such a search
1993), which include at least one explicit state- is time consuming and requires patience as well
ment about the application of tests to such indi- as knowledge about available resources. The fol-
viduals, and there are guidelines for the dis- lowing steps can be of assistance:
closure of test data (APA, 1996). All of these
documents are the result of hard and contin- 1. The first step in obtaining information about
uing work on the part of many professional a specific test is to consult the MMY. If the test is
organizations. commercially published and has been reviewed
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

12 Part One. Basic Issues

in the MMY, then our job will be infinitely easier; Binet (e.g., J. R. Graham, 1990; Knapp, 1976;
the MMY will give us the publishers’ address and Megargee, 1972; Snider & Osgood, 1969).
we can write for a catalog or information. It may 9. Another source of information is Educational
also list references that we can consult, typically Testing Service (ETS), the publisher of most of the
journal articles that are relevant. But what if the college and professional school entrance exams.
test is not listed in the MMY? ETS has an extensive test library of more than
2. A second step is to check the original citation 18,000 tests and, for a fee, can provide informa-
where mention of the particular test is made. For tion. Also, ETS has published annually since 1975
example, we may be reading a study by Jones Tests in Microfiche, sets of indices and abstracts to
which used the Smith Anxiety Scale; typically various research instruments; some libraries sub-
Jones will provide a reference for the Smith Anx- scribe to these.
iety Scale. We can locate that reference and then 10. A number of journals such as the Journal of
write to Smith for information about that scale. Counseling and Development and the Journal of
Smith’s address will hopefully be listed in Smith’s Psychoeducational Assessment, routinely publish
article, or we can look up Smith’s address in direc- test reviews.
tories such as the American Psychological Asso- 11. Finally, many books are collections of test
ciation Directory or a “Who’s Who.” reviews, test descriptions, etc., and provide useful
3. A third step is to conduct a computer literature information on a variety of tests. Some of these
search. If the test is well known we might obtain are listed in Table 1.1.
quite a few citations. If the test is somewhat more
obscure, we might miss the available informa-
tion. Keep in mind that currently most computer SUMMARY
literature searches only go back a limited number
of years. A test can be defined as an objective and stan-
4. If steps 2 and 3 give us some citations, we might dardized measure of a sample of behavior. We
locate these citations in the Social Sciences Cita- can also consider a test as an experiment, an inter-
tion Index; for example, if we locate the citation view, or a tool. Tests can be used as part of psy-
to the Smith Anxiety Scale, the Science Citation chological assessment, and are used for classifi-
Index will tell us which articles use the Smith cita- cation, self-understanding, program evaluation,
tion in their list of references. Presumably these and scientific inquiry. From the viewpoint of tests
articles might be of interest to us. as an experiment, we need to pay attention to
four categories of variables that can influence the
5. Suppose instead of a specific test we are inter-
outcome: the method of administration, situa-
ested in locating a scale of anxiety that we might
tional variables, experimenter variables, and sub-
use in our own study, or we want to see some of
ject variables. Tests are used for decision making,
the various ways in which anxiety is assessed. In
although the content of a test need not coincide
such a case, we would again first check the MMY
with the area of behavior that is assessed, other
to see what is available and take some or all of the
than to be empirically related.
following steps.
Tests can be categorized according to whether
6. Search the literature for articles/studies on they are commercially published or not admin-
anxiety to see what instruments have been used. istrative aspects such as group versus individual
We will quickly observe that there are several tests, the type of item, the area of assessment, the
instruments that seem to be quite popularly used function of the test, how scores are interpreted,
and many others that are not. whether the test is a self-report or not, the age
7. We might repeat steps 2 and 3 above. range and type of client, and the measurement
8. If the test is a major one, whether commer- properties.
cially published or not, we can consult the library Ethical standards relate to testing and the issues
to see what books have been written about that of informed consent, confidentiality, and privacy.
particular test. There are many books available on There are many sources of information about
such tests as the Rorschach, the Minnesota Mul- tests available through libraries, associations, and
tiphasic Personality Inventory, and the Stanford- other avenues of research.
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

The Nature of Tests 13

Table 1–1. Sources for test information

Andrulis, R. S. (1977). Adult assessment. Springfield, Lake, D. G., Miles, M. B., & Earle, R. B., Jr. (1973).
IL: Charles C Thomas. Measuring human behavior. New York: Teachers
Six major categories of tests are listed, including apti- College Press.
tude and achievement, personality, attitudes, and personal A review of 84 different instruments and 20 compendia
performance. of instruments; outdated but still useful.
Beere, C. A. (1979). Women and women’s issues: Mangen, D. J., & Peterson, W. A. (Eds.) (1982).
A handbook of tests and measures. San Francisco: Research instruments in social gerontology; 2
Jossey-Bass. volumes. Minneapolis: University of Minnesota
This handbook covers such topics as sex roles, gender Press.
knowledge, and attitudes toward women’s issues, and gives If you are interested in measurement of the elderly this
detailed information on a variety of scales. is an excellent source. For each topic, for example death
Chun, K. T. et al. (1975). Measures for psycholog- and dying, there is a brief overall discussion, some brief
ical assessment: A guide to 3000 original sources commentary on the various instruments, a table of the
and their applications. Ann Arbor: University of cited instruments, a detailed description of each instru-
ment, and a copy of each instrument.
Michigan.
An old but still useful source for measures of mental health.
McReynolds, P. (Ed.) (1968). Advances in psycho-
logical assessment. Palo Alto: Science and Behav-
Compton, C. (1980). A guide to 65 tests for special ior Books.
education. Belmont, California: Fearon Education.
This is an excellent series of books, the first one pub-
A review of tests relevant to special education. lished in 1968, each book consisting of a series of chap-
Comrey, A. L., Backer, T. F., & Glaser, E. M. (1973). ters on assessment topics, ranging from reviews of spe-
A sourcebook for mental health measures. Los cific tests like the Rorschach and the California Psycho-
Angeles: Human Interaction Research Institute. logical Inventory (CPI), to topic areas like the assessment
of anxiety, panic disorder, and adolescent suicide.
A series of abstracts on about 1,100 lesser known measures
in areas ranging from alcoholism through mental health, all Newmark, C. S. (Ed.) (1985; 1989), Major psycho-
the way to vocational tests. logical assessment instruments, volumes I and II.
Corcoran, K., & Fischer, J. (1987). Measures for clinical Boston: Allyn & Bacon.
practice: A sourcebook. New York: Free Press. A nice review of the most widely used tests in current
psychological assessment, the volumes give detailed
A review of a wide variety of measures to assess various
information about the construction, administration,
clinical problems.
interpretation, and status of these tests.
Fredman, N., & Sherman, R. (1987). Handbook of
Reeder, L. G., Ramacher, L., & Gorelnik, S. (1976).
measurements for marriage and family therapy. New
Handbook of scales and indices of health behav-
York: Bruner Mazel.
ior. Pacific Palisades, CA.: Goodyear Publishing.
A review of 31 of the more widely used paper-and-pencil
A somewhat outdated but still useful source.
instruments in the area of marriage and family therapy.
Goldman, B. A., & Saunders, J. L. (1974). Directory Reichelt, P. A. (1983). Location and utilization of
of unpublished experimental mental measures, Vol. available behavioral measurement instruments.
1–4. New York: Behavioral Publications. Professional Psychology, 14, 341–356.
Includes an annotated bibliography of various compen-
The first volume contains a listing of 339 unpublished tests
dia of tests.
that were cited in the 1970 issues of a group of journals.
Limited information is given on each one. Robinson, J. P., Shaver, P. R., & Wrightsman,
Hogan, J., & Hogan, R. (Eds.) (1990). Business and L. S. (Eds.) (1990). Measures of personality and
industry testing. Austin, TX: Pro-ed. social psychological attitudes. San Diego, CA.:
Academic Press.
A review of tests especially pertinent to the world of work,
such as intelligence, personality, biodata, and integrity tests. Robinson and his colleagues at the Institute for Social
Research (University of Michigan) have published a num-
Johnson, O. G. (1970; 1976). Tests and measure-
ber of volumes summarizing measures of political atti-
ments in child development. San Francisco: Jossey- tudes (1968), occupational attitudes and characteristics
Bass. (1969), and social-psychological attitudes (1969, 1973, &
The two volumes cover unpublished tests for use with 1991).
children. Schutte, N. S., & Malouff, J. M. (1995). Sourcebook
Keyser, D. J., & Sweetland, R. C. (Eds.) (1984). Test of adult assessment strategies. New York: Plenum
critiques. Kansas City: Test Corporation of America. Press.
This is a continuing series that reviews the most frequently A collection of scales, their description and evaluation,
used tests, with reviews written by test experts, and quite to assess psychopathology, following the diagnostic cat-
detailed in their coverage. The publisher, Test Corporation egories of the Diagnostic and Statistical Manual of Men-
of America, publishes a variety of books on testing. tal Disorders.
P1: JZP
0521861810c01 CB1038/Domino 0 521 86181 0 February 24, 2006 14:9

14 Part One. Basic Issues

Table 1–1. (continued)

Shaw, M. E., & Wright, J. M. (1967). Scales for psychology, education, and business. Kansas City:
the measurement of attitudes. New York: McGraw- Test Corporation of America.
Hill.
This is the first edition of what has become a contin-
An old but still useful reference for attitude scales. Each uing series. In this particular volume, over 3,000 tests,
scale is reviewed in some detail, with the actual scale items both commercially available and unpublished, are given
given. a brief thumbnail sketches.
Southworth, L. E., Burr, R. L., & Cox, A. E. (1981).
Screening and evaluating the young child: A hand- Walker, D. K. (1973). Socioemotional measures
book of instruments to use from infancy to six years. for preschool and kindergarten children. San
Springfield, IL: Charles C Thomas. Francisco: Jossey-Bass.

A compendium of preschool screening instruments, but A review of 143 measures covering such areas as person-
without any evaluation of these instruments. ality, self-concept, attitudes, and social skills.
Straus, M. A. (1969). Family measurement tech- Woody, R. H. (Ed.) (1980). Encyclopedia of clinical
niques. Minneapolis: University of Minnesota Press. assessment. 2 vols. San Francisco: Jossey-Bass.
A review of instruments reported in the psychological and
This is an excellent, though now outdated, overview of
sociological literature from 1935 to 1965.
clinical assessment; The 91 chapters cover a wide variety
Sweetland, R. C., & Keyser, D. J. (Eds.) (1983). of tests ranging from measures of normality to moral
Tests: A comprehensive reference for assessments in reasoning, anxiety, and pain.

SUGGESTED READINGS An interesting series of papers reflecting the long standing


ethical concerns involved in testing.
Dailey, C. A. (1953). The practical utility of the clin-
ical report. Journal of Consulting Psychology, 17, 297– Wolfle, D. (1960). Diversity of Talent. American Psy-
302. chologist, 15, 535–545.
An old but still interesting article that illustrates the need for
An interesting study that tried to quantify how clinical proce- broader use of tests.
dures, based on tests, contribute to the decisions made about
patients.
DISCUSSION QUESTIONS
Fremer, J., Diamond, E. E., & Camara, W. J. (1989).
Developing a code of fair testing practices in education. 1. What has been your experience with tests?
American Psychologist, 44, 1062–1067.
2. How would you design a study to assess
A brief historical introduction to a series of conferences that whether a situational variable can alter test
eventuated into a code of fair testing practices, and the code performance?
itself.
3. Why not admit everyone who wants to enter
Lorge, I. (1951). The fundamental nature of mea- medical school, graduate programs in business,
surement. In. E. F. Lindquist (Ed.), Educational Mea- law school, etc.?
surement, pp. 533–559. Washington, D.C.: American 4. After you have looked at the MMY in the
Council on Education. library, discuss ways in which it could be
An excellent overview of measurement, including the NOIR improved.
system. 5. If you were to go to the University’s Counseling
Willingham, W. W. (Ed.). (1967). Invasion of privacy Center to take a career interest test, how would
in research and testing. Journal of Educational Mea- you expect the results to be handled? (e.g., should
surement, 4, No. 1 supplement. your parents receive a copy?).
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

2 Test Construction, Administration,


and Interpretation

AIM This chapter looks at three basic questions: (1) How are tests constructed?
(2) What are the basic principles involved in administering a test? and (3) How can
we make sense of a test score?

CONSTRUCTING A TEST apy), or it may be very theoretical (a scale to


assess “anomie” or “ego-strength”). Often, the
How does one go about constructing a test?
need may be simply a desire to improve what is
Because there are all sorts of tests, there are also
already available or to come up with one’s own
all sorts of ways to construct such tests, and
creation.
there is no one approved or sure-fire method of
doing this. In general, however, test construction
involves a sequence of 8 steps, with lots of excep- 2. The role of theory. Every test that is devel-
tions to this sequence. oped is implicitly or explicitly influenced or
guided by the theory or theories held by the
1. Identify a need. The first step is the identifi- test constructor. The theory may be very explicit
cation of a need that a test may be able to fulfill. and formal. Sigmund Freud, Carl Rogers, Emile
A school system may require an intelligence test Durkheim, Erik Erikson, and others have all
that can be administered to children of various developed detailed theories about human behav-
ethnic backgrounds in a group setting; a liter- ior or some aspect of it, and a practitioner of
ature search may indicate that what is available one of these theories would be heavily and know-
doesn’t fit the particular situation. A doctoral stu- ingly influenced by that theory in constructing a
dent may need a scale to measure “depth of emo- test. For example, most probably only a Freudian
tion” and may not find such a scale. A researcher would construct a scale to measure “id, ego, and
may want to translate some of Freud’s insights superego functioning” and only a “Durkheimite”
about “ego defense” mechanisms into a scale that would develop a scale to measure “anomie.”
measures their use. A psychologist may want to These concepts are embedded in their respective
improve current measures of leadership by incor- theories and their meaning as measurement vari-
porating new theoretical insights, and therefore ables derives from the theoretical framework in
develops a new scale. Another psychologist likes a which they are embedded.
currently available scale of depression, but thinks A theory might also yield some very specific
it is too long and decides to develop a shorter ver- guidelines. For example, a theory of depression
sion. A test company decides to come out with might suggest that depression is a disturbance in
a new career interest test to compete with what four areas of functioning: self-esteem, social sup-
is already available on the market. So the need port, disturbances in sleep, and negative affect.
may be a very practical one (we need a scale to Such a schema would then dictate that the mea-
evaluate patients’ improvement in psychother- sure of depression assess each of these areas.

15
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

16 Part One. Basic Issues

The theory may also be less explicit and not eight life stages), perhaps their relative impor-
well formalized. The test constructor may, for tance (are they all of equal importance?), and
example, view depression as a troublesome state how many items each subtopic will contribute
composed of negative feelings toward oneself, a to the overall test (I might decide, for example,
reduction in such activities as eating and talking that each of the eight stages should be assessed by
with friends, and an increase in negative thoughts 15 items, thus yielding a total test of 120 items).
and suicide ideation. The point is that a test is This table of specifications may reflect not only
not created in a vacuum, nor is it produced by a my own thinking, but the theoretical notions
machine as a yardstick might be. The creation of present in the literature, other tests that are avail-
a test is intrinsically related to the person doing able on this topic, and the thinking of colleagues
the creating and, more specifically, to that per- and experts. Test companies that develop educa-
son’s theoretical views. Even a test that is said tional tests such as achievement batteries often
to be “empirically” developed, that is, developed go to great lengths in developing such a table of
on the basis of observation or real-life behav- specifications by consulting experts, either indi-
ior (how do depressed people answer a ques- vidually or in group conferences; the construc-
tionnaire about depression), is still influenced by tion of these tests often represent major efforts of
theory. many individuals, at a high cost beyond the reach
Not all psychologists agree. R. B. Cattell (1986), of any one person.
for example, argues that most tests lack a true The table of specifications may be very formal
theoretical basis, that their validity is due to or very informal, or sometimes absent, but leads
work done after their construction rather than to the writing or assembling of potential items.
before, and that they lack good initial theoret- These items may be the result of the test con-
ical construction. Embretson (1985b) similarly structor’s own creativity, they may be obtained
argues that although current efforts have pro- from experts, from other measures already avail-
duced tests that do well at predicting behavior, the able, from a reading of the pertinent literature,
link between these tests and psychological theory from observations and interviews with clients,
is weak and often nonexistent. and many other sources. Writing good test items
is both an art and a science and is not easily
3. Practical choices. Let’s assume that I have achieved. I suspect you have taken many instruc-
identified as a need the development of a scale tor made tests where the items were not clear,
designed to assess the eight stages of life that Erik the correct answers were quite obvious, or the
Erikson discusses (Erikson, 1963; 1982; see G. items focused on some insignificant aspects of
Domino & Affonso, 1990, for the actual scale). your coursework. Usually, the classroom instruc-
There are a number of practical choices that now tor writes items and uses most of them. The pro-
need to be made. For example, what format will fessional test constructor knows that the initial
the items have? Will they be true-false, multiple pool of items needs to be at a minimum four or
choice, 7-point rating scales, etc.? Will there be a five times as large as the number of items actually
time limit or not? Will the responses be given on a needed.
separate answer sheet? Will the response sheet be
machine scored? Will my instrument be a quick 5. Tryouts and refinement. The initial pool of
“screening” instrument or will it give compre- items will probably be large and rather unrefined.
hensive coverage for each life stage? Will I need Items may be near duplications of each other,
to incorporate some mechanism to assess honesty perhaps not clearly written or understood. The
of response? Will my instrument be designed for intent of this step is to refine the pool of items to
group administration? a smaller but usable pool. To do this, we might
ask colleagues (and/or enemies) to criticize, the
4. Pool of items. The next step is to develop items, or we might administer them to a captive
a table of specifications, much like the blueprint class of psychology majors to review and identify
needed to construct a house. This table of speci- items that may not be clearly written. Sometimes,
fications would indicate the subtopics to be cov- pilot testing is used where a preliminary form is
ered by the proposed test (in our example, the administered to a sample of subjects to determine
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 17

whether there are any glitches, etc. Such pilot test- 7. Standardization and norms. Once we have
ing might involve asking the subjects to think established that our instrument is both reliable
aloud as they answer each item or to provide and valid, we need to standardize the instru-
feedback as to whether the instructions are clear, ment and develop norms. To standardize means
the items interesting, and so on. We may also do that the administration, time limits, scoring pro-
some preliminary statistical work and assemble cedures, and so on are all carefully spelled out
the test for a trial run called a pretest. For example, so that no matter who administers the test, the
if I were developing a scale to measure depres- procedure is the same. Obviously, if I adminis-
sion, I might administer my pool of items (say ter an intelligence test and use a 30-minute time
250) to groups of depressed and nondepressed limit, and you administer the same test with a
people and then carry out item analyses to see 2-hour time limit, the results will not be compa-
which items in fact differentiate the two groups. rable. It might surprise you to know that there
For example, to the item “I am feeling blue” I are some tests both commercially published and
might expect significantly more depressed peo- not that are not well standardized and may even
ple to answer “true” than nondepressed people. lack instructions for administration.
I might then retain the 100 items that seem to Let’s assume that you answer my vocabulary
work best statistically, write each item on a 3 × 5 test, and you obtain a score of 86. What does that
card, and sort these cards into categories accord- 86 mean? You might be tempted to conclude that
ing to their content; such as all the items dealing 86 out of 100 is fairly good, until I tell you that
with sleep disturbances in one pile, all the items second graders average 95 out of 100. You’ll recall
dealing with feelings in a separate pile, and so on. that 86 and 95 are called raw scores, which in psy-
This sorting might indicate that we have too many chology are often meaningless. We need to give
items of one kind and not enough of another, meaning to raw scores by changing them into
so I might remove some of the excess items and derived scores; but that may not be enough. We
write some new ones for the underrepresented also need to be able to compare an individual’s
category. Incidentally, this process is known as performance on a test with the performance of a
content analysis (see Gottschalk & Gleser, 1969). group of individuals; that information is what we
This step then, consists of a series of procedures, mean by norms. The information may be limited
some requiring logical analysis, others statisti- to the mean and standard deviation for a particu-
cal analysis, that are often repeated several times, lar group or for many different groups, or it may
until the initial pool of items has been reduced be sufficiently detailed to allow the translation of
to manageable size, and all the evidence indi- a specific raw score into a derived score such as
cates that our test is working the way we wish percentiles, T scores, z scores, IQ units, and so
it to. on.
The test constructor then administers the test
6. Reliability and validity. Once we have to one or more groups, and computes some basic
refined our pool of items to manageable size, descriptive statistics to be used as norms, or nor-
and have done the preliminary work of the above mative information. Obviously, whether the nor-
steps, we need to establish that our measuring mative group consists of 10 students from a com-
instrument is reliable, that is, consistent, and munity college, 600 psychiatric patients, or 8,000
measures what we set out to measure, that is, sixth graders, will make quite a difference; test
the test is valid. These two concepts are so basic norms are not absolute but simply represent the
and important that we devote an entire chapter performance of a particular sample at a particular
to them (see Chapter 3). If we do not have relia- point in time. The sample should be large enough
bility and validity, then our pool of items is not a that we feel comfortable with its size, although
measuring instrument, and it is precisely this that “large enough” cannot be answered by a specific
distinguishes the instruments psychologists use number; simply because a sample is large, does
from those “questionnaires” that are published in not guarantee that it is representative. The sam-
popular magazines to determine whether a per- ple should be representative of the population to
son is a “good lover,” “financially responsible,” which we generalize, so that an achievement test
or a “born leader.” for use by fifth graders should have norms based
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

18 Part One. Basic Issues

on fifth graders. It is not unusual for achieve- 28 are working appropriately, but 3 should be
ment tests used in school systems to have nor- thrown out since their contribution is minimal.
mative samples in the tens of thousands, chosen (For some examples of factor analysis applied to
to be representative on the basis of census data tests, see Arthur & Woehr, 1993; Carraher, 1993;
or other guiding principles, but for most tests the Casey, Kingery, Bowden & Corbett, 1993; Corn-
sample size is often in the hundreds or smaller. well, Manfredo, & Dunlap, 1991; W. L. Johnson
The sample should be clearly defined also so that & A. M. Johnson, 1993).
the test user can assess its adequacy – was the Finally, there are a number of tests that are
sample a captive group of introductory psychol- multivariate, that is the test is composed of many
ogy students, or a “random” sample representa- scales, such as in the MMPI and the CPI. The
tive of many majors? Was the sample selected on pool of items that comprises the entire test is con-
specific characteristics such as income and age, sidered to be an “open system” and additional
to be representative of the national population? scales are developed based upon arising needs.
How were the subjects selected? For example, when the MMPI was first devel-
oped it contained nine different clinical scales;
8. Further refinements. Once a test is made subsequently hundreds of scales have been devel-
available, either commercially or to other oped by different authors. (For some examples,
researchers, it often undergoes refinements and see Barron, 1953; Beaver, 1953; Giedt & Downing,
revisions. Well-known tests such as the Stanford- 1961; J. C. Gowan & M. S. Gowan, 1955; Klein-
Binet have undergone several revisions, some- muntz, 1961; MacAndrew, 1965; Panton, 1958.)
times quite major and sometimes minor. Some-
times the changes reflect additional scientific
TEST ITEMS
knowledge, and sometimes societal changes, as
in our greater awareness of gender bias in Writing test items. Because the total test is no
language. better than its components, we need to take a
One type of revision that often occurs is the closer look at test items. In general, items should
development of a short form of the original test. be clear and unambiguous, so that responses
Typically, a different author takes the original test, do not reflect a misunderstanding of the item.
administers it to a group of subjects, and shows Items should not be double-barreled. For exam-
by various statistical procedures that the test can ple, “I enjoy swimming and tennis” is a poor
be shortened without any substantial loss in reli- item because you would not know whether the
ability and validity. Psychologists and others are response of “true” really means that the per-
always on the lookout for brief instruments, and son enjoys both of them, only one of them, or
so short forms often become popular, although as outdoor activities in general. Items should not
a general rule, the shorter the test the less reliable use words such as “sometimes” or “frequently”
and valid it is. (For some examples of short forms because these words might mean different things
see Burger, 1975; Fischer & Fick, 1993; Kaufman, to different people. An item such as, “Do you
1972; Silverstein, 1967.) have headaches frequently?” is better written as,
Still another type of revision that occurs fairly “Do you have a headache at least once a week?”
frequently comes about by factor analysis. Let’s (For more detailed advice on writing test items
say I develop a questionnaire on depression see Gronlund, 1993; Kline, 1986; Osterlind, 1989;
that assesses what I consider are four aspects of Thorndike & Hagen, 1977; for a bibliography of
depression. A factor analysis might indeed indi- citations on test construction, see O’Brien, 1988).
cate that there are four basic dimensions to my
test, and so perhaps each should be scored sepa- Categories of items. There are two basic cate-
rately, in effect, yielding four scales. Or perhaps, gories of items: (1) constructed-response items
the results of the factor analysis indicate that there where the subject is presented with a stimulus and
is only one factor and that the four subscales I produces a response – essay exams and sentence-
thought were separate are not. Therefore, only completion tests are two examples; (2) selected-
one score should be generated. Or the factor anal- response items where the subject selects the cor-
ysis might indicate that of the 31 items on the test, rect or best response from a list of options – the
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 19

typical multiple-choice question is a good a particular test can include more items and
example. therefore a broader coverage. They can also be
There is a rather extensive body of literature scored quickly and inexpensively, so that results
on which approach is better under what circum- are obtained rapidly and feedback provided with-
stances, with different authors taking different out much delay. There is also available comput-
sides of the argument (see Arrasmith, Sheehan, erized statistical technology that allows the rapid
& Applebaum, 1984, for a representative study). computation of item difficulty and other useful
indices.
Types of items. There are many types of items
At the same time, multiple-choice items have
(see Jensen, 1980; Wesman, 1971). Some of the
been severely criticized. One area of criticism
more common ones:
is that multiple-choice items are much easier
1. Multiple-choice items. These are a common to create for isolated facts than for conceptual
type, composed of a stem that has the question understanding, and thus they promote rote learn-
and the response options or choices, usually four ing rather than problem-solving skills. Currently,
or five, which are the possible answers. Multiple- there seems to be substantial pressure to focus
choice items should assess the particular content on constructed-response tasks; however, such an
area, rather than vocabulary or general intelli- approach has multiple problems and may in fact
gence. The incorrect options, called distractors, turn out to be even more problematic (Bennet &
should be equally attractive to the test taker, and Ward, 1993).
should differentiate between those who know the
correct answer and those who don’t. The cor- 2. True-false items. Usually, these consist of
rect response is called the keyed response. Some- a statement that the subject identifies as true
times, multiple-choice items are used in tests that or false, correct, or incorrect, and so on. For
assess psychological functioning such as depres- example:
sion or personality aspects, in which case there are
no incorrect answers, but the keyed response is Los Angeles is the capital of California.
the one that reflects what the test assesses. When I enjoy social gatherings.
properly written, multiple-choice items are excel-
lent. There are available guidelines to write good Note that in the first example, a factual statement,
multiple-choice items. Haladyna and Downing there is a correct answer. In the second exam-
(1989a; 1989b) surveyed some 46 textbooks and ple there is not, but the keyed response would
came up with 43 rules on how to write multiple- be determined theoretically or empirically; if
choice items; they found that some rules had been the item were part of a scale of introversion-
extensively researched but others had not. Prop- extraversion, a true answer might be scored for
erly constructed multiple-choice items can mea- extraversion.
sure not only factual knowledge, but also theo- From a psychometric point of view, factual
retical understanding and problem-solving skills. true-false statements are not very useful. Guess-
At the same time, it is not easy to write good ing is a major factor because there is a 50% prob-
multiple-choice items with no extraneous cues ability of answering correctly by guessing, and it
that might point to the correct answer (such as may be difficult to write meaningful items that
the phrase “all of the above”) and with content indeed are true or false under all circumstances.
that assesses complex thinking skills rather than Los Angeles is not the capital of California but
just recognition of rote memory material. there was a period when it was. Often the item
Although most multiple-choice items are writ- writer needs to include words like usually, never,
ten with four or five options, a number of writers and always that can give away the correct answer.
have presented evidence that three option items Personality- or opinion-type true-false items, on
may be better (Ebel, 1969; Haladyna & Downing, the other hand, are used quite frequently and
1994; Lord, 1944; Sidick, Barrett, & Doverspike, found in many major instruments.
1994). Most textbooks argue that true-false items, as
Multiple-choice items have a number of used in achievement tests, are the least satisfac-
advantages. They can be answered quickly, so tory item format. Other textbooks argue that
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

20 Part One. Basic Issues

the limitations are more the fault of the item 6. Matching items. These typically consists of
writer than with the item format itself. Frisbie two lists of items to be matched, usually of
and Becker (1991) reviewed the literature and unequal length to counteract guessing. For
formulated some 21 rules to writing true-false example:
items.
Cities States
A. Toledo 1. California
3. Analogies. These are commonly found in
B. Sacramento 2. Michigan
tests of intelligence, although they can be used
C. Phoenix 3. North Carolina
with almost any subject matter. Analogies can be
D. Ann Arbor 4. Ohio
quite easy or difficult and can use words, num-
E. Helena 5. Montana
bers, designs, and other formats. An example
6. Arizona
is:
7. South Dakota
46 is to 24 as 19 is to 8. Idaho
(a) 9, (b) 13, (c) 38, (d) 106 Matching items can be useful in assessing spe-
(in this case, the answer is 9, because 4 × 6 = 24, cific factual knowledge such as names of authors
1 × 9 = 9). and their novels, dates and historical events, and
so on. One problem with matching items is that
Analogies may or may not be in a multiple-choice mismatching one component can result in mis-
format, although providing the choices is a better matching other components; thus the compo-
strategy psychometrically. Like any good multiple nents are not independent.
choice item, an analogy item has only one correct
answer.
7. Completion items. These provide a stem and
require the subject to supply an answer. If poten-
4. Odd-man-out. These items are composed of tial answers are given, this becomes a multiple-
words, numbers, etc., in which one component choice item. Examples of completion items
does not belong. For example: are:
donkey, camel, llama, ostrich
Wundt established his laboratory in the year .
(Here ostrich does not belong because all the
I am always .
other animals have four legs, whereas ostriches
have two.) Note that the response possibilities in the first
These items can also be quite varied in their dif- example are quite limited; the respondent gives
ficulty level and are not limited to words. The either a correct or an incorrect answer. In the sec-
danger here is that the dimension underlying the ond example, different respondents can supply
item (leggedness in the above example) may not quite different responses. Sentence completion
be the only dimension, may not be necessarily items are used in some tests of personality and
meaningful, and may not be related to the vari- psychological functioning.
able being measured.
8. Fill in the blank. This can be considered a
5. Sequences. This consists of a series of compo- variant of the completion item, with the required
nents, related to each other, with the last missing response coming in a variety of positions. For
item to be generated by the subject or to be iden- example:
tified from a multiple-choice set. For example: established the first psychological laboratory.
6, 13, 17, 24, 28, Wundt established a laboratory at the University of
(a) 32, (b) 35, (c) 39, (d) 46 in the year .

(Here the answer is 35 because the series of num- 9. Forced choice items. Forced choice items
bers increases alternately by 7 points and 4 points: consist of two or more options, equated as
6 + 7 = 13; 13 + 4 = 17; 17 + 7 = 24; etc.) to attractiveness or other qualities, where the
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 21

subject must choose one. This type of item is relative ease. The shortcoming of such items is
used in some personality tests. For example: that they only yield the information of whether
the subject answered correctly or incorrectly, or
Which item best characterizes you:
whether the subject chose “true” rather than
(a) I would rather go fishing by myself. “false” or “option A” rather than “option B.” They
do not tell us whether the choice reflects lucky
(b) I would rather go fishing with friends.
guessing, test “wiseness,” or actual knowledge.
Presumably, choice (a) would reflect introver- Subjective items, such as essay questions, on
sion, while choice (b) would reflect extraversion; the other hand, allow the respondent to respond
whether the item works as intended would need in what can be a unique and revealing way. Guess-
to be determined empirically. ing is somewhat more difficult, and the informa-
tion produced is often more personal and reveal-
10. Vignettes. A vignette is a brief scenario, like ing. From a clinical point of view, open-ended
the synopsis of a play or novel. The subject is items such as, “Tell me more about it?” “What
asked to react in some way to the vignette, per- brings you here?” or “How can I be of help?”
haps by providing a story completion, choosing are much more meaningful in assessing a client.
from a set of alternatives, or making some type Psychometrically, such responses are difficult to
of judgment. Examples of studies that have used quantify and treat statistically.
vignettes are those of G. Domino and Hannah
(1987), who asked American and Chinese chil- Which item format to use? The choice of a par-
dren to complete brief stories; of DeLuty (1988– ticular item format is usually determined by the
1989), who had students assess the acceptabil- test constructor’s preferences and biases, as well
ity of suicide; of Wagner and Sternberg (1986), as by the test content. For example, in the area
who used vignettes to assess what they called of personality assessment, many inventories have
“tacit” knowledge; and of Iwao and Triandis used a “true-false” format rather than a multiple-
(1993), who assessed Japanese and American choice format. There is relatively little data that
stereotypes. can serve as guidance to the prospective test
author – only some general principles and some
11. Rearrangement or continuity items. This unresolved controversies.
is one type of item that is relatively rare but has One general principle is that statistical analy-
potential. These items measure a person’s knowl- ses require variation in the raw scores. The item,
edge about the order of a series of items. For “are you alive at this moment” is not a good item
example, we might list a set of names, such as because, presumably, most people would answer
Wilhelm Wundt, Lewis Terman, Arthur Jensen, yes. We can build in variation by using item
etc., and ask the test taker to rank these in chrono- formats with several choices, such as multiple-
logical order. The difficulty with this type of item choice items or items that require answering
is the scoring, but Cureton (1960) has provided a “strongly agree, agree, undecided, disagree, or
table that can be used in a relatively easy scoring strongly disagree,” rather than simply true-false;
procedure that reflects the difference between the we can also increase variation by using more
person’s answers and the scoring key. items – a 10-item test can yield scores that range
from 0 to 10, while a 20-item test can yield scores
Objective-subjective continuum. Different that range from 0 to 20. If the items use the
kinds of test items can be thought of as “strongly agree . . . strongly disagree” response
occupying a continuum along a dimension of format, we can score each item from 1 to 5, and
objective-subjective: the 10-item test now can yield raw scores from
10 to 50.
objective ———————————— subjective
One unresolved controversy is whether item
From a psychometric point of view objective response formats such as “strongly agree . . .
items, such as multiple-choice items are the best. strongly disagree” should have an “undecided”
They are easily scored, contain only one cor- option or should force respondents to choose
rect answer, and can be handled statistically with sides; also should the responses be an odd
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

22 Part One. Basic Issues

number so a person can select the middle “neu- observe that person’s ability to throw a ball, run
tral” option, or should the responses be an even 50 yards, pass, and so on. If we wanted to assess
number, so the subject is forced to choose? Johnny’s arithmetic knowledge we would give
An example of the data available comes from a him arithmetic problems to solve. Note that in
study by Bendig (1959) who administered a per- the latter case, we could easily test Johnny’s per-
sonality inventory to two samples, one receiving formance by traditional test items, although a
the standard form with a trichotomous response purist might argue that we need to take Johnny
(true, ?, false), the other a form that omitted the ? to the grocery store and see if he can compute how
response. The results were pretty equivalent, and much six oranges and three apples cost, and how
Bendig (1959) concluded that using a dichoto- much change he will receive from a $5 bill. This
mous response was more economical in terms is of course, not a new idea. Automobile driv-
of scoring cost (now, it probably does not make ing tests, Red Cross swimming certification, and
any difference). For another example, see Tzeng, cardiopulmonary resuscitation are all examples
Ware, and Bharadwaj (1991). of such performance testing. Advocates of direct
assessment argue that such assessment should
Sequencing of items. Items in a test are usually more closely resemble the actual learning tasks
listed according to some plan or rationale rather and should allow the candidate to show higher-
than just randomly. In tests of achievement or order cognitive skills such as logical reasoning,
intelligence, a common strategy is to have easy innovative problem solving, and critical think-
items at the beginning and progressively difficult ing. Thus, the multiple-choice format is being
items toward the end. Another plan is to use a de-emphasized and more focus is being placed
spiral omnibus format, which involves a series of on portfolios, writing samples, observations, oral
items from easy to difficult, followed by another reports, projects, and other “authentic” proce-
series of items from easy to difficult, and so on. In dures [see the special issue of Applied Psycholog-
tests of personality where the test is composed of ical Measurement, 2000 (Vol. 24, No. 4)].
many scales, items from the same scale should not The concepts of reliability and validity apply
be grouped together, otherwise the intent of each equally well to standard assessment as to authen-
scale becomes obvious and can alter the responses tic measurement, and the difficulties associated
given. Similarly, some scales contain filler items with authentic testing are rather challenging
that are not scored but are designed to “hide” the (Hambleton & Murphy, 1992; M. D. Miller &
real intent of the scale. The general rule to be fol- Linn, 2000). In addition to individual schol-
lowed is that we want test performance to reflect ars, researchers affiliated with Educational Test-
whatever it is that the test is measuring, rather ing Service and other companies are researching
than some other aspect such as fatigue, boredom, these issues, although it is too early to tell whether
speed of response, second-guessing, and so on; their efforts will have a major future impact.
so where possible, items need to be placed in a
sequence that will offset any such potential con-
PHILOSOPHICAL ISSUES
founding variables.
In addition to practical questions, such as what
Direct assessment. Over the years, great dissat- type of item format to use, there are a number of
isfaction has been expressed about these various philosophical issues that guide test construction.
types of items, especially multiple-choice items. One such question is, “How do we know when an
Beginning about 1990, a number of investigators item is working the way it is supposed to?” Three
have begun to call for “authentic” measurement basic answers can be given: by fiat, by criterion
(Wiggins, 1990). Thus, more emphasis is being keying, and by internal consistency.
given to what might be called direct or perfor-
mance assessment, that is, assessment providing By fiat. Suppose you put together a set of items
for direct measurement of the product or per- to measure depression. How would you know
formance generated. Thus, if we wanted to test that they measure depression? One way, is to
the competence of a football player we would not simply state that they do, that because you are
administer a multiple-choice exam, but would an expert on depression, that because the items
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 23

reflect our best thinking about depression, and identify those who have leadership capabilities to
that because the content of all the items is clearly different degrees, but it may not necessarily mea-
related to depression, therefore your set of items sure leadership in a theoretical sense because the
must be measuring depression. Most psychol- items were chosen for their statistical relationship
ogists would not accept this as a final answer, rather than their theoretical cogency.
but this method of fiat (a decree on the basis Criterion-keyed scales are typically heteroge-
of authority), can be acceptable as a first step. neous or multivariate. That is, a single scale
The Beck Depression Inventory, which is proba- designed to measure a single variable is typically
bly one of the most commonly used measures of composed of items that, theoretically and/or in
depression, was initially developed this way (A. T. content, can be quite different from each other,
Beck, 1967), although subsequent research has and thus, it can be argued, represent different
supported its utility. The same can be said of the variables. In fact, a content analysis or a factor
Stanford-Binet test of intelligence. analysis of the scale items might indicate that
the items fall in separate clusters. This is because
Criterion-keyed tests. Many of the best known the criterion used is typically complex; GPA does
tests such as the MMPI, CPI, and Strong Voca- not just reflect academic achievement, but also
tional Interest Blank, were constructed using this interest, motivation, grading policies of different
method. Basically, a pool of items is adminis- teachers, and so on. Retained items may then be
tered to a sample of subjects, for whom we also retained because they reflect one or more of these
obtain some information on a relevant criterion, aspects.
for example, scores on another test, GPA, ratings A related criticism sometimes made about
by supervisors, etc. For each test item we perform such scales is that the results are a function of the
a statistical analysis (often using correlation) that particular criterion used. If in a different situation
shows whether the item is empirically related to a different criterion is used, then presumably the
the criterion. If it does, the item is retained for scale may not work. For example, if in selecting
our final test. This procedure may be done sev- items for a depression scale the criterion is “psy-
eral times with different samples, perhaps using chiatric diagnosis,” then the scale may not work
different operational definitions for the criterion. in a college setting where we may be more con-
The decision to retain or reject a test item is based cerned about dropping out or suicide ideation.
solely on its statistical power, on its relationship This of course, is a matter of empirical validity
to the criterion we have selected. and cannot be answered by speculation. In fact,
The major problem with this approach is the scales from tests such as the CPI have worked
choice of criterion. Let’s assume I have developed remarkably well in a wide variety of situations.
a pool of items that presumably assess intelli- A good example of empirical scale construc-
gence. I will administer this pool of items to a tion is the study by Rehfisch (1958), who set
sample of subjects and also obtain some data about to develop a scale for “personal rigid-
for these subjects on some criterion of intelli- ity.” He first reviewed the literature to define the
gence. What criterion will I use? Grade point rigidity-flexibility dimension and concluded that
average? Yearly income? Self-rated intelligence? the dimension was composed of six aspects: (1)
Teacher ratings? Number of inventions? Listing constriction and inhibition, (2) conservatism, (3)
in a “Who’s Who?” Each of these has some seri- intolerance of disorder and ambiguity, (4) obses-
ous limitations, and I am sure you appreciate sional and perseverative tendencies, (5) social
the fact that in the real world criteria are com- introversion, and (6) anxiety and guilt. At this
plex and far from perfect. Each of these criteria point, he could have chosen to write a pool of
might also relate to a different set of items, so items to reflect these six dimensions and publish
the items that are retained reflect the criterion his scale on the basis of its theoretical under-
chosen. pinnings and his status as an “expert” – this
Some psychologists have difficulties with the would have been the fiat method we discussed
criterion-keyed methodology in that the retained above. Or he could have chosen to administer
set of items may work quite well, but the theo- the pool of items to a large group of subjects
retical reason may not be obvious. A scale may and through factor analysis determine whether
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

24 Part One. Basic Issues

the results indicated one main factor, presum- sis of anxiety) related to our pool of items. The
ably rigidity, or six factors, presumably the above responses are translated numerically (e.g., true =
dimensions. We discuss this method next. 1, false = 2), and the numbers are subjected to
Instead he chose to use data that was already factor analysis. There are a number of techniques
collected by researchers at the Institute of Person- and a number of complex issues involved in fac-
ality Assessment and Research of the University tor analysis, but for our purposes we can think
of California at Berkeley. At this institute, a num- of factor analysis as a correlational analysis with
ber of different samples, ranging from graduate items being correlated with a mythical dimension
students to Air Force captains, had been adminis- called a factor. Each item then has a factor load-
tered – batteries of tests, including the CPI and the ing, which is like a correlation coefficient between
MMPI, had been rated by IPAR staff on a num- responses on that item and the theoretical dimen-
ber of dimensions, including “rigidity.” Rehfisch sion of the factor. Items that load significantly on
simply analyzed statistically the responses to the a particular factor are assumed to measure the
combined CPI-MMPI item pool (some 957 true- same variable and are retained for the final scale.
false statements) of the subjects rated highest Factor analysis does not tell us what the psycho-
and lowest 25% on rigidity. He cross-validated, logical meaning of the factor is, and it is up to the
that is replicated the analysis, on additional sam- test constructor to study the individual items that
ples. The result was a 39-item scale that corre- load on the factor, and name the factor accord-
lated significantly with a variety of ratings, and ingly. A pool of items may yield several factors
which was substantially congruent with the the- that appear to be statistically “robust” and psy-
oretical framework. High scorers on this scale chologically meaningful, or our interest may lie
tend to be seen as anxious, overcontrolled, inflex- only in the first, main factor and in the one scale.
ible in their social roles, orderly, and uncom- As with criterion-keying, there have been a
fortable with uncertainty. Low scorers tend to number of criticisms made of the factor-analytic
be seen as fluent in their thinking and in their approach to test construction. One is that fac-
speech, outgoing in social situations, impulsive, tor analysis consists of a variety of procedures,
and original. Interestingly enough, scores on the each with a variety of assumptions and arbi-
scale correlated only .19 with ratings of rigid- trary decisions; there is argument in the litera-
ity in a sample of medical school applicants. It ture about which of the assumptions and deci-
is clear that the resulting scale is a “complex” sions are reasonable and which are not (e.g.,
rather than a “pure” measure of rigidity. In fact, Gorsuch, 1983; Guilford, 1967b; Harman, 1960;
a content analysis of the 39 items suggested that Heim, 1975).
they can be sorted into eight categories ranging Another criticism is that the results of a factor
from “anxiety and constriction in social situa- analysis reflect only what was included in the pool
tions” to “conservatism and conventionality.” A of items. To the extent that the pool of items is
subsequent study by Rehfisch (1959) presented restricted in content, then the results of the factor
some additional evidence for the validity of this analysis will be restricted. Perhaps I should indi-
scale. cate here that this criticism is true of any pool of
items, regardless of what is done to the items, but
Factor-analysis as a way of test construction. that usually those of the criterion-keying persua-
This approach assumes that scales should be uni- sion begin with pool items that are much more
variate and independent. That is, scales should heterogeneous. In fact, they will often include
measure only one variable and should not corre- items that on the surface have no relationship to
late with scales that measure a different variable. the criterion, but the constructor has a “hunch”
Thus, all the items retained for a scale should be that the item might work.
homogeneous, they should all be interrelated. Still another criticism is that the factor ana-
As in the criterion-keying method, we begin lytic dimensions are theoretical dimensions, use-
with a pool of items that are administered to a ful for understanding psychological phenomena,
sample of subjects. The sample may be one of but less useful as predictive devices. Real-life
convenience (e.g., college sophomores) or one behavior is typically complex; grades in college
of theoretical interest (patients with the diagno- reflect not just mastery of specific topic areas, but
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 25

general intelligence, motivation, aspiration level, a professional to be friendly but businesslike, so if


the pressures of an outside job, personal relation- the warmth becomes “gushiness,” rapport might
ships such as being “in love,” parental support, decrease. Rapport is typically enhanced if the sub-
sleep habits, and so on. A factor analytic scale of ject understands why she or he is being tested,
intelligence will only measure “pure intelligence” what the tests will consist of, and how the result-
(whatever that may be) and thus not correlate ing scores will be used. Thus, part of establish-
highly with GPA, which is a complex and hetero- ing rapport might involve allaying any fears or
geneous variable. (To see how a factor analytic suspicions the subject may have. Rapport is also
proponent answers these criticisms, see P. Kline, enhanced if the subject perceives that the test is
1986.) an important tool to be used by a competent
professional for the welfare of the client.
ADMINISTERING A TEST
INTERPRETING TEST SCORES
If we consider a test as either an interview or an
experiment, then how the test is administered A test usually yields a raw score, perhaps the
becomes very important. If there is a manual number of items answered correctly. Raw scores
available for the particular test, then the man- in themselves are usually meaningless, and they
ual may (or may not) have explicit directions on need to be changed in some way to give them
how to administer the test, what specific instruc- meaning. One way is to compare the raw score to
tions to read, how to answer subjects’ questions, a group average – that is what the word “norm”
what time limits if any to keep, and so on. means, normal or average. Thus, you obtained a
raw score of 72 on a vocabulary test, and upon
Rapport. One of the major aspects of test admin- finding that the average raw score of a sample of
istration involves rapport, the “bond” that is cre- college students is 48, you might be quite pleased
ated between examiner and examinee, so that the with your performance. Knowing the average is,
subject is cooperative, interested in the task, and of course, quite limited information. When we
motivated to pay attention and put forth a best have a raw score we need to locate that raw score
effort. Sometimes such motivation is strongly in more precise terms than simply above or below
affected by outside factors. A premedical student average. Normative data then typically consist
eager to be accepted into medical school will typi- not just of one score or average, but the actual
cally be quite cooperative and engaged in the task scores of a representative and sizable sample that
of taking a medical college admissions test; a juve- allow you to take any raw score and translate it
nile delinquent being assessed at the request of a into a precise location in that normative group.
judge, may not be so motivated. To do this, raw scores need to be changed into
In the American culture, tests and question- derived scores.
naires are fairly common, and a typical high
school or college student will find little difficulty Percentiles. Let’s suppose that our normative
in following test directions and doing what is group contained 100 individuals and, by sheer
being asked in the time limit allotted. Individu- luck, each person obtained a different score
als such as young children, prisoners, emotionally on our vocabulary test. These scores could be
disturbed persons, or individuals whose educa- ranked, giving a 1 to the lowest score and a 100
tional background has not given them substantial to the highest score. If John now comes along
exposure to testing, may react quite differently. and takes the vocabulary test, his raw score can
Rapport then is very much like establishing a be changed into the equivalent rank – his score
special bond with another person, such as occurs of 76 might be equivalent to the 85th rank. In
in friendships, in marriage, and in other human effect, that is what percentile scores are. When
relationships. There are no easy steps to do so, we have a distribution of raw scores, even if they
and no pat answers. Certainly, if the examiner are not all different, and regardless of how many
appears to be a warm and caring person, sensi- scores we have, we can change raw scores into
tive to the needs of the subject, rapport might be percentiles. Percentiles are a rank, but they rep-
easier to establish. On the other hand, we expect resent the upper limit of the rank. For example,
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

26 Part One. Basic Issues

a score at the 86th percentile is a score that is Consider a test where the mean is 62 and the
higher than 86 out of 100, and conversely lower SD is 10. John obtained a raw score of 60, Barbara,
than 14 out of 100; a score at the 57th percentile a raw score of 72, and Consuelo, a raw score of
is a score that is higher than 57 out of 100, and 78. We can change these raw scores into z scores
lower than 43 out of 100. Note that the highest through the following formula:
possible percentile is 99 (no score can be above X−M
all 100), and the lowest possible percentile is 1 z=
SD
(no one can obtain a score that has no rank).
where X is the raw score
Percentiles are intrinsically meaningful in that
it doesn’t matter what the original scale of mea- M is the mean and
surement was, the percentile conveys a con- SD is the standard deviation
crete position (see any introductory statistical For John, his raw score of 60 equals:
text for the procedure to calculate percentiles). 60 − 62
Percentiles have one serious limitation; they are z= = −0.2
10
an ordinal scale rather than an interval scale.
Although ranks would seem to differ by only For Barbara, her raw score of 72 equals:
one “point,” in fact different ranks may differ 72 − 62
z= = +1.0
by different points depending on the underlying 10
raw score distribution. In addition, if you have and for Consuelo, her raw score of 78 equals:
a small sample, not all percentile ranks will be 78 − 62
represented, so a raw score of 72 might equal the z= = +1.60
10
80th percentile, and a raw score of 73, the 87th
percentile. We can plot these 3 z scores on a normal curve
graph and obtain a nice visual representation of
Standard scores. We said that just knowing the their relative positions (see Figure 2.1).
average is not sufficient information to precisely Note that changing raw scores into z scores
locate a raw score. An average will allow us to does not alter the relative position of the three
determine whether the raw score is above or individuals. John is still the lowest scoring per-
below the average, but we need to be more pre- son, Consuelo the highest, and Barbara is in the
cise. If the average is 50 and the raw score is middle. Why then change raw scores into z scores?
60, we could obviously say that the raw score Aside from the fact that z scores represent a scale
is “10 points above the mean.” That would be of measurement that has immediate meaning (a
a useful procedure, except that each test has its z score of +3 is a very high score no matter what
own measurement scale – on one test the highest the test, whereas a raw score of 72 may or may not
score might be 6 points above the mean, while be a high score), z scores also allow us to compare
on another test it might be 27 points above the across tests. For example, on the test above with
mean, and how far away a score is from the mean mean of 62 and SD of 10, Consuelo obtained a
is in part a function of how variable the scores are. raw score of 78. On a second test, with mean of
For example, height measured in inches is typ- 106 and SD of 9, she obtained a raw score of 117.
ically less variable than body weight measured On which test did she do better? By changing the
in ounces. To equalize for these sources of varia- raw scores to z scores the answer becomes clear.
tion we need to use a scale of measurement that On test A, Consuelo’s raw score of 78 equals:
transcends the numbers used, and that is pre- 78 − 62
cisely what the standard deviation gives us. If we z= = +1.60
10
equate a standard deviation to one, regardless of
On test B, Consuelo’s raw score of 117 equals:
the scale of measurement, we can express a raw
177 − 106
score as being x number of standard deviations z= = +1.22
above or below the mean. To do so we change 9
our raw scores into what are called standard or z Plotting these on a normal curve graph, as in
scores, which represent a scale of measurement Figure 2.2, we see that Consuelo did better on
with mean equal to zero and SD equal to 1. test A.
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 27

x x x

M SD
Consuelo's raw score of 78
62 10 z score = + 1.60

Barbara's raw score of 72


z score = + 1

John's raw score of 60


z score = −.2
FIGURE 2–1. Relative positions of three z scores.

M SD

Test A 62 10

Test B 106 9

Raw score of
Raw score of
78 on Test A
117 on Test B
equals a
equals a
z score of = + 1.60
z score of = + 1.22
FIGURE 2–2. Equivalency of raw scores to z scores.
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

28 Part One. Basic Issues

T scores. The problem with z scores is that they stanine: 1 2&3 4,5,6 7&8 9
defined poor below average above superior
can involve both positive and negative numbers
as: average average
as well as decimal numbers, and so are somewhat percentage: 4 19 54 19 4
difficult to work with. This is a problem that can
be easily resolved by changing the mean and SD or a tripartite classification:
of z scores to numbers that we might prefer. Sup- stanine: 1,2,3 4,5,6 7,8,9
pose we wanted a scale of measurement with a defined as: low average high
mean of 50 and a SD of 10. All we need to do percentage: 23 54 23
is multiply the z score we wish to change by the
desired SD and add the desired mean. For exam- Sometimes stanines actually have 11 steps,
ple, to change a z score of + 1.50 we would use this where the stanine of 1 is divided into 0 and 1
formula: (with 1% and 3% of the cases), and the stanine
of 9 is divided into 9 and 10 (with 3% and 1%
new score = z(desired SD) + desired mean of the cases). Other variations of stanines have
= +1.50(10) + 50 been prepared, but none have become popular
(Canfield, 1951; Guilford & Fruchter, 1978). Note
= 65 that unlike z scores and T scores, stanines force
the raw score distribution into a normal distribu-
This new scale, with a mean of 50 and SD of 10 is
tion, whereas changing raw scores into z scores
used so often in testing, especially for personality
or T scores using the above procedures does not
tests, that it is given a name: T scores; when you
change the shape of the distribution. Don’t lose
see T scores reported, you automatically know
sight of the fact that all of these different scales of
that the mean is 50 and the SD is 10, and that
measurement are really equivalent to each other.
therefore a score of 70 is two standard deviations
Figure 2.3 gives a graphic representation of these
above the mean.
scales.
Educational Testing Service (ETS) uses a scale
of measurement with mean of 500 and SD of 100
for its professional tests such as the SAT and the ITEM CHARACTERISTICS
GRE. These are really T scores with an added zero.
Note that an individual would not obtain a score We now need to take a closer look at two aspects of
of 586 – only 580, 590, and so on. test items: item difficulty and item discrimination.

Stanines. Another type of transformation of raw Item Difficulty


scores is to a scale called stanine (a contraction The difficulty of an item is simply the percent-
of standard nine) that has been used widely in age of persons who answer the item correctly.
both the armed forces and educational testing. Note that the higher the percentage the easier
Stanines involve changing raw scores into a nor- the item; an item that is answered correctly by
mally shaped distribution using nine scores that 60% of the respondents has a p (for percentage)
range from 1 (low) to 9 (high), with a mean of 5 value of .60. A difficult item that is answered cor-
and SD of 2. The scores are assigned on the basis rectly by only 10% has a p = .10 and an easy item
of the following percentages: answered correctly by 90% has a p = .90. Not
all test items have correct answers. For exam-
stanine: 1 2 3 4 5 6 7 8 9
percentage: 4 7 12 17 20 17 12 7 4
ple, tests of attitudes, of personality, of politi-
cal opinions, etc., may present the subject with
Thus, in a distribution of raw scores, we would items that require agreement-disagreement, but
take the lowest 4% of the scores and call all of for which there is no correct answer. Most items
them ones, then the next 7% we would call two’s, however, have a keyed response, a response that
and so on (all identical raw scores would however if endorsed is given points. On a scale of anxiety,
be assigned the same stanine). a “yes” response to the item, “are you nervous
Stanines can also be classified into a fivefold most of the time?” might be counted as reflect-
classification as follows: ing anxiety and would be the keyed response.
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 29

34.13 34.13

Percent of cases 13.59 13.59


under portions of
2.14 2.14
the normal curve

Standard Deviations
−3σ −2σ −1σ 0 +1σ +2σ +3σ

Percentiles
1 3 16 50 84 97 99
z scores
−3 −2 −1 0 +1 +2 +3
T scores
20 30 40 50 60 70 80
Stanines
1 3 5 7 9

Deviation IQs
55 70 85 100 115 130 145
FIGURE 2–3. Relationships of different types of scores, based on the normal distribution.

If the test were measuring “calmness,” then a give different answers, and the answers are related
“no” response to that item might be the keyed to some behavior, to that degree are the items use-
response. Thus item difficulty can simply rep- ful, and thus generally the most useful items are
resent the percentage who endorsed the keyed those with p near .50.
response. The issue is, however, somewhat more compli-
cated. Assume we have a test of arithmetic, with
What level difficulty? One reason we may wish all items of p = .50. Children taking the test would
to know the difficulty level of items is so we can presumably not answer randomly, so if Johnny
create tests of different difficulty levels, by judi- gets item 1 correct, he is likely to get item 2 cor-
cious selection of items. In general, from a psy- rect, and so on. If Mark misses item 1, he is likely
chometric point of view, tests should be of aver- to miss item 2, and so on. This means, at least the-
age difficulty, average being defined as p = .50. oretically, that one half of the children would get
Note that this results in a mean score near 50%, all the items correct and one half would get all of
which may seem quite a demanding standard. them incorrect, so that there would be only two
The reason for this is that a p = .50 yields the most raw scores, either zero or 100 – a very unsatisfac-
discriminating items, items that reflect individual tory state of affairs. One way to get around this is
differences. Consider items that are either very to choose items whose average value of difficulty
difficult (p = .00) or very easy (p = 1.00). Psy- is .50, but may in fact range widely, perhaps from
chometrically, such items are not useful because .30 to .70, or similar values.
they do not reflect any differences between indi- Another complicating factor concerns the tar-
viduals. To the degree that different individuals get “audience” for which the test will be used.
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

30 Part One. Basic Issues

92% area FIGURE 2–4. Example of an easy test


item passed by 92% of the sample.

z score
of −1.41

Let’s say I develop a test to identify the bright- tive values and decimals. For example, ETS uses a
est 10% of entering college freshmen for possible delta scale with a mean of 13 and a SD = 4. Thus
placement in an honors program. In that case, the delta scores = z (4) + 13. An item with p = .58
test items should have an average p = .10, that is, would yield a z score of −.20 which would equal
the test should be quite difficult with the average a delta score of:
p value reflecting the percentage of scores to be
selected – in this example, 10%. Tests such as the (−.20)(4) + 13 = 12.2 (rounding off = 12)
SAT or GRE are quite demanding because their
difficulty level is quite high. The bandwidth-fidelity dilemma. In developing
a test, the test constructor chooses a set of items
Measurement of item difficulty. Item difficulty from a larger pool, with the choice based on ratio-
then represents a scale of measurement identical nal and/or statistical reasons. Classical test theory
with percentage, where the average is 50% and the suggests that the best test items are those with
range goes from zero to 100%. This is of course a .50 difficulty level – for example, a multiple
an ordinal scale and is of limited value because choice item where half select the correct answer,
statistically not much can be done with ordinal and half the distractors. If we select all or most
measurement. There is a way however, to change of the items at that one level of difficulty, we will
this scale to an interval scale, by changing the have a very good instrument for measuring those
percent to z scores. All we need to do is have a individuals who indeed fall at that level on the
table of normal curve frequencies (see appendix) trait being measured. However, for individuals
and we can read the z scores directly from the who are apart from the difficulty level, the test
corresponding percentage. Consider for exam- will not be very good. For example, a person who
ple, a very easy item with p = .92, represented by is low on the trait will receive a low score based on
Figure 2.4. Note that by convention, higher scores the few correctly answered items; a person who
are placed on the right, and we assume that the is high will score high, but the test will be “easy”
92% who got this item correct were higher scor- and again won’t provide much information. In
ing individuals (at least on this item). We need this approach, using a “peaked” conventional test
then to translate the percentage of the area of the (peaked because the items peak at a particular dif-
curve that lies to the right (92%) into the appro- ficulty level), we will be able to measure some of
priate z score, which our table tells us is equal the people very well and some very poorly.
to −1.41. We can try to get around this by using a rect-
A very difficult item of p = .21 would yield a angular distribution of items, that is, selecting a
z score of +0.81 as indicated in Figure 2.5. Note few items at a .10 level of difficulty, a few at .20, a
that items that are easy have negative z scores, few at .30 and so on to cover the whole range of
and items that are difficult have positive z scores. difficulty, even though the average range of dif-
Again, we can change z scores to a more manage- ficulty will still be .50. There will be items here
able scale of measurement that eliminates nega- that are appropriate for any individual no matter
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 31

FIGURE 2–5. Example of a difficult test


item passed by 21% of the sample.
21%
area

z score
of +0.81

where they are on the trait, but because a test corrections of the total score have been developed
cannot be too long, the appropriate items for any to take guessing into account, such as:
one person will be few. This means that the test wrong
will be able to differentiate between individuals score = right −
k−1
at various levels of a trait, but the precision of
these differentiations will not be very great. where k = the number of alternatives per item.
A peaked conventional test can provide high The rationale here is that the probability of
fidelity (i.e., precision) where it is peaked, but a correct guess is 1/k and the probability of an
little bandwidth (i.e., it does not differentiate very incorrect guess is k – 1/k. So we expect, on the
well individuals at other positions on the scale). average, for a person to be correct once for every
Conversely, a rectangular conventional test has k – 1 times that they are incorrect. The prob-
good bandwidth but low overall fidelity (Weiss, lem is that correction formulas such as the above
1985). assume that item choices are equally plausible,
and that items are of two types – those that
the subject knows and answers correctly and
Guessing. Still another complicating factor in those that the subject doesn’t know, and guesses
item difficulty is that of guessing. Although indi- blindly.
viduals taking a test do not usually answer ran- Note that the more choices there are for each
domly, just as typically there is a fair amount item, the less significant guessing becomes. In
of guessing going on, especially with multiple- true-false items, guessing can result in 50% cor-
choice items where there is a correct answer. This rect responses. In five-choice multiple-choice
inflates the p value because a p value of .60 really items, guessing can result in 20% correct answers,
means that among the 60% who answered the but if each item had 20 choices (an awkward state
item correctly, a certain percentage answered it of affairs), guessing would only result in 5% cor-
correctly by lucky guessing, although some will rect responses.
have answered it incorrectly by bad guessing (see A simpler, but not perfect, solution, is to
Lord, 1952). include instructions on a test telling all can-
A number of item forms, such as multiple- didates to do the same thing – that is, guess
choice items, can be affected by guessing. On when unsure, leave doubtful items blank, etc.
a multiple-choice examination, with each item (Diamond & Evans, 1973).
composed of five choices, anyone guessing
blindly would, by chance alone, answer about one
fifth of the items correctly. If all subjects guessed Item Discrimination
to the same degree, guessing would not be much
of a problem. But subjects don’t do that, so guess- If we have a test of arithmetic, each item on that
ing can be problematic. A number of formulas or test should ideally differentiate between those
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

32 Part One. Basic Issues

who know the subject matter and those who Table 2–1
don’t know. If we have a test of depression, each
Test Index of
item should ideally differentiate between those item Upper 27 Lower 27 discrimination
who are depressed and those who are not. Item
1 23 (85%) 6 (22%) 63%
discrimination refers to the ability of an item
2 24 (89%) 22 (81%) 8%
to correctly “discriminate” between those who 3 6 (22%) 4 (15%) 7%
are higher on the variable in question and those 4 9 (33%) 19 (70%) −37%
who are lower. Note that for most variables we
don’t ordinarily assume a dichotomy but rather
a continuous variable – that is, we don’t believe the best strategy is to select the upper 27% and the
that the world is populated by two types of peo- lower 27%, although slight deviations from this,
ple, depressed and nondepressed, but rather that such as 25% or 30%, don’t matter much. (Note
different people can show different degrees of that in the example of the rigidity scale developed
depression. by Rehfisch, he analyzed the top and bottom 25%
There are a number of ways of computing of those rated on rigidity.)
item-discrimination indices, but most are quite For our sample of 100 children we would then
similar (Oosterhof, 1976) and basically involve select the top 27 scorers and call them “high scor-
comparing the performance of high scorers with ers” and the bottom 27 and call these “low scor-
that of low scorers, for each item. Suppose for ers.” We would look at their answers for each
example, we have an arithmetic test that we have test item and compute the difficulty level of each
administered to 100 children. For each child, we item, separately for each group, using percent-
have a total raw score on the test, and a record ages. The difference between difficulty levels for
of their performance on each item. To compute a particular item is the index of discrimination
item discrimination indices for each item, we first (abbreviated as D) for that item. Table 2.1 gives
need to decide how we will define “high scorer” an example of such calculations.
vs. “low scorer.” Note that the index of discrimination is
Obviously, we could take all 100 children, com- expressed as a percentage and is computed from
pute the median of their total test scores, and label two percentages. We could do the same calcula-
those who scored above the median as high scor- tions on the raw scores, in this cases the number of
ers, and those below the median as low scorers. correct responses out of 27, but the results might
The advantage of this procedure is that we use differ from test to test, if the size of the sample
all the data we have, all 100 protocols. The dis- changes.
advantage is that at the center of the distribution The information obtained from such an anal-
there is a fair amount of “noise.” Consider Sarah, ysis can be used to make changes in the items and
who scored slightly above the median and is thus improve the test. Note, for example, that item 1
identified as a high scorer. If she were to retake seems to discriminate quite well. Most of the high
the test, she might well score below the median scorers (85%) answered the item correctly, while
and now be identified as a low scorer. far fewer of the low scorers (22%) answered the
At the other extreme, we could take the five item correctly. Theoretically, a perfectly discrimi-
children who really scored high and label them nating item would have a D value of 100%. Items
high scorers and the five children who scored low- 2 and 3 don’t discriminate very well, item 2 is
est and label them low scorers. The advantage too easy and item 3 is too difficult. Item 4 works
here is that these extreme scores are not likely to but in reverse! Fewer of the higher scorers got the
change substantially on a retest; they most likely item correctly. If this is an item where there is a
are not the result of guessing and probably rep- correct answer, a negative D would alert us that
resent “real-life” correspondence. The disadvan- there is something wrong with the item, that it
tage is that now we have rather small samples, needs to be rewritten. If this were an item from a
and we can’t be sure that any calculations we per- personality test where there is no correct answer,
form are really stable. Is there a happy medium the negative D would in fact tell us that we need
that on the one hand keeps the “noise” to a min- to reverse the scoring.
imum and on the other maximizes the size of We have chosen to define high scorer and low
the sample? Years ago, Kelley (1939) showed that scorer on the basis of the total test score itself.
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 33

This may seem a bit circular, but it is in fact quite Philosophies of testing. And so once again we
legitimate. If the test measures arithmetic knowl- are faced with the notion that we have alterna-
edge, then a high scorer on arithmetic knowledge tives, and although the proponents of each alter-
is indeed someone who scores high on the test. native argue that theirs is the way, the choice
There is a second way, however, to define high comes down to personal preference and to com-
and low scorers, or more technically to identify patible philosophy of testing. With regard to test
extreme groups, and that is to use a criterion that construction, there seem to be two basic camps.
is not part of the test we are calibrating. For exam- One approach, that of factor analysis, believes
ple, we could use teacher evaluations of the 100 that tests should be pure measures of the dimen-
children as to which ones are good in math and sion being assessed. To develop such a pure mea-
which ones are not. For a test of depression, we sure, items are selected that statistically correlate
could use psychiatric diagnosis. For a personality as high as possible with each other and/or with
scale of leadership, we could use peer evaluation, the total test score. The result is a scale that is
self-ratings, or data obtained from observations. homogeneous, composed of items all of which
Does it matter whether we compute item dis- presumably assess the same variable. To obtain
crimination indices based on total test scores or such homogeneity, factor analysis is often used,
based on an external criterion? If we realize that so that the test items that are retained all center
such computations are not simply an exercise to on the same dimension or factor. Tests developed
fill time, but are done so we can retain those items this way must not correlate with other dimen-
with the highest D values, those items that work sions. For example, scores on a test of anxiety
best, then which procedure we use becomes very must not correlate with scores on a test of depres-
important because different procedures result in sion, if the two dimensions are to be measured
the retention of different items. If we use the total separately. Tests developed this way are often use-
test score as our criterion, an approach called ful for understanding a particular psychologi-
internal consistency, then we will be retaining cal phenomenon, but scores on the test may in
items that tend to be homogeneous, that is items fact not be highly related to behavior in the real
that tend to correlate highly with each other. If we world.
use an external criterion, that criterion will most A second philosophy, that of empiricism,
likely be more complex psychologically than the assumes that scales are developed because their
total test score. For example, teachers’ evaluations primary function is to predict real-life behavior,
of being “good at math” may reflect not only math and items are retained or eliminated depend-
knowledge, but how likeable the child is, how ing on how well they correlate with such real-
physically attractive, outgoing, all-around intel- life behavior. The result is a test that is typ-
ligent, and so on. If we now retain those items that ically composed of heterogeneous items all of
discriminate against such a complex criterion, we which share a correlation with a non test cri-
will most likely retain heterogeneous items, items terion, but which may not be highly correlated
that cover a wide variety of the components of our with each other. Such scales often correlate sig-
criterion. If we are committed to measuring arith- nificantly with other scales that measure different
metic knowledge in as pure a fashion as possible, variables, but the argument here is that, “that’s
then we will use the total test score as our crite- the way the world is.” As a group, people who are
rion. If we are interested in developing a test that intellectually bright also tend to be competent,
will predict to the maximum degree some real- sociable, etc., so scales of competence may most
world behavior, such as teachers’ recognition of a likely correlate with measures of sociability, and
child’s ability, then we will use the external crite- so on. Such scales are often good predictors of
rion. Both are desirable practices and sometimes real-life behaviors, but may sometimes leave us
they are combined, but we should recognize that wondering why the items work as they do. For
the two practices represent different philosophies an interesting example of how these two philoso-
of testing. Allen and Yen (1979) argue that both phies can lead their proponents to entirely dif-
practices cannot be used simultaneously, that a ferent views, see the reviews of the CPI in the
test constructor must choose one or the other. seventh MMY (Goldberg, 1972; Walsh, 1972),
Anastasi (1988), on the other hand, argues that and in the ninth MMY (Baucom, 1985; Eysenck,
both are important. 1985).
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

34 Part One. Basic Issues

Item response theory (IRT). The “classical” the- to which a test item discriminates between high-
ory of testing goes back to the early 1900s when and low-scoring groups, (3) the difficulty of the
Charles Spearman developed a theoretical frame- item, and (4) the probability that a person of
work based on the simple notion that a test low ability on that variable makes the correct
score was the sum of a “true” score plus ran- response.
dom “error.” Thus a person may obtain different
IQs on two intelligence tests because of differing
NORMS
amounts of random error; the true score presum-
ably does not vary. Reliability is in fact a way of No matter what philosophical preferences we
assessing how accurately obtained scores covary have, ultimately we are faced with a raw score
with true scores. obtained from a test, and we need to make sense
A rather different approach known as item of that score. As we have seen, we can change that
response theory (IRT) began in the 1950s pri- raw score in a number of ways, but eventually we
marily through the work of Frederic Lord and must be able to compare that score with those
George Rasch. IRT also has a basic assumption obtained for a normative sample, and so we need
and that is that performance on a test is a func- to take a closer look at norms.
tion of an unobservable proficiency variable. IRT
has become an important topic, especially in How are norms selected? Commercial compa-
educational measurement. Although it is a dif- nies that publish tests (for a listing of these consult
ficult topic that involves some rather sophisti- the MMY) may have the financial and technical
cated statistical techniques beyond the scope of means to administer a test to large and repre-
this book (see Hambleton & Swaminathan, 1985; sentative groups of subjects in a variety of geo-
Lord, 1980), the basic idea is understandable. graphical settings. Depending on the purpose of
The characteristics of a test item, such as item the test, a test manual may present the scores
difficulty, are a function of the particular sample of subjects listed separately for such variables as
to whom the item was administered. A vocab- gender (males vs. females), school grade (e.g.,
ulary item may, for example, be quite difficult fifth graders, sixth graders, etc.), time of testing
for second graders but quite easy for college stu- (e.g., high-school seniors at the beginning of their
dents. Thus in classical test theory, item difficulty, senior year vs. high-school seniors near the end of
item discrimination, normative scores, and other the school year), educational level (high-school
aspects are all a function of the particular sam- graduates, college graduates, etc.), geographical
ples used in developing the test and generating region (Northeast, Southwest, etc.) and other rel-
norms; typically, a raw score is interpreted in evant variables or combination of variables.
terms of relative position within a sample, such Sometimes the normative groups are formed
as percentile rank or other transformation. IRT, on the basis of random sampling, and sometimes
on the other hand, focuses on a theoretical math- they are formed on the basis of certain criteria,
ematical model that unites the characteristics of for example U.S. Census data. Thus if the census
an item, such as item difficulty, to an underlying data indicate that the population is composed
hypothesized dimension. Although the parame- of different economic levels, we might wish to
ters of the theoretical model are estimated from test a normative sample that reflects those spe-
a specific set of data, the computed item char- cific percentages; this is called a stratified sample.
acteristics are not restricted to a specific sample. More typically, especially with tests that are not
This means, in effect, that item pools can be cre- commercially published, norms are made up of
ated and then subsets of items selected to meet samples of convenience. An investigator develop-
specific criteria – for example, a medium level of ing a scale of leadership ability might get a sample
difficulty. Or subset of items can be selected for of local business leaders to take the test, perhaps
specific examinees (for a readable review of IRT in return for a free lecture on “how to improve
see Loyd, 1988). one’s leadership competence,” or might have a
Basically, then, IRT is concerned with the inter- friend teaching at a graduate college of business
play of four aspects: (1) the ability of the individ- agree to administer the test to entering students.
ual on the variable being assessed, (2) the extent Neither of these samples would be random, and
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 35

one might argue neither would be representative. child answering 16 items correctly would be given
As the test finds continued use, a variety of sam- eight months’ credit, and so on.
ples would be tested by different investigators and Unfortunately, this practice leads to some
norms would be accumulated, so that we could strange interpretations of test results. Consider
learn what average scores are to be expected from Maria, a fourth grader, who took a reading com-
particular samples, and how different from each prehension test. She answered correctly all of
other specific samples might be. Often, despite the items at the fourth grade and below, so she
the nonrandomness, we might find that groups receives a score of 4 years. In addition however,
do not differ all that much – that the leader- she also answered correctly several items at the
ship level exhibited by business people in Lin- fifth-grade level, several items at the sixth-grade
coln, Nebraska, is not all that different from that level, a few at the seventh-grade level, and a couple
exhibited by their counterparts in San Francisco, at the eighth-grade level. For all of these items,
Atlanta, or New York City. she receives an additional 2 years credit, so her
final score is sixth school year. Most likely when
her parents and her teacher see this score they will
Age norms. Often we wish to compare a per-
conclude incorrectly that Maria has the reading
son’s test score with the scores obtained by a nor-
comprehension of a sixth grader, and that there-
mative group of the same age. This makes sense
fore she should be placed in the sixth grade, or
if the variable being assessed changes with age.
at the very least in an accelerated reading group.
When we are testing children, such age norms
In fact, Maria’s performance is typical. Despite
become very important because we expect, for
our best efforts at identifying test items that are
example, the arithmetic knowledge of a 5-year-
appropriate for a specific grade level, children
old to be different from that of a 9-year-old.
will exhibit scatter, and rarely will their perfor-
With some variables, there may be changes occur-
mance conform to our theoretical preconcep-
ring well within a short time span, so we might
tions. The test can still be very useful in iden-
need age norms based on a difference of a few
tifying Maria’s strengths or weaknesses, and in
months or less. With adults, age norms are typi-
providing an objective benchmark, but we need
cally less important because we would not expect,
to be careful of our conclusions.
for example, the average 50-year-old person to
A related approach to developing grade-
know more (or less) arithmetic than the average
equivalent scores is to compute the median score
40-year-old. On the other hand, if we are testing
for pupils tested at a particular point in time.
college students on a measure of “social support”
Let’s say, for example, we assess eight graders
we would want to compare their raw scores with
in their fourth month of school and find that
norms based on college students rather than on
their median score on the XYZ test of reading
retired senior citizens.
is 93. If a child is then administered the test
and obtains a score of 93, that child is said to
School grade norms. At least in our culture, have a grade equivalent of 8.4. There is another
most children are found in school and school- problem with grade-equivalent scores and that is
ing is a major activity of their lives. So tests that that school grades do not form an interval scale,
assess school achievement in various fields, such even though the school year is approximately
as reading, social studies, etc., often have norms equal in length for any pupil. Simply consider the
based on school grades. If we accept the theoret- fact that a second grader who is one year behind
ical model that a school year covers 10 months, his classmates in reading is substantially more
and if we accept the fiction that learning occurs “retarded” than an eighth grader who is one year
evenly during those 10 months, we can develop behind.
a test where each item is assigned a score based
on these assumptions. For example, if our fifth- Expectancy tables. Norms can be presented in a
grade reading test is composed of 20 items, each variety of ways. We can simply list the mean and
item answered correctly could be given one-half SD for a normative group, or we can place the
month-credit, so a child answering all items cor- data in a table showing the raw scores and their
rectly would be given one school-year credit, a equivalent percentiles, T scores, etc. For example,
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

36 Part One. Basic Issues

Table 2–2 what score a person would need to obtain to be


hired; such a score is called the cutoff score.
Equivalent percentiles
A few additional points follow about
Raw score Male Female expectancy tables. Because we need to change
47 99 97 the frequencies into percentages, a more useful
46 98 95 expectancy table is one where the author has
45 98 93 already done this for us. Second, decisions based
44 97 90
43 96 86
on expectancy tables will not be foolproof. After
42 94 81 all, one of the lowest scoring persons in our exam-
etc. ple turned out to be an excellent worker. An
expectancy table is based on a sample that may
have been representative at the time the data was
Table 2.2 gives some normative information, such collected, but may no longer be so. For example,
as you might find in a test manual. our fictitious company might have gotten a repu-
If we are using test scores to predict a partic- tation for providing excellent benefits, and so the
ular outcome, we can incorporate that relation- applicant pool may be larger and more heteroge-
ship into our table, and the table then becomes an neous. Or the economy might have changed for
expectancy table, showing what can be expected the worse, so that candidates who never would
of a person with a particular score. Suppose, have thought of doing manual labor are now
for example, we administer a test of mechanical applying for positions. To compute an expectancy
aptitude to 500 factory workers. After 6 months table, we need to have the scores for both variables
we obtain for each worker supervisors’ ratings for a normative sample, and the two sets of scores
indicating the quality of work. This situation is must show some degree of correlation. Once the
illustrated in Table 2.3. Note that there were 106 data are obtained for any new candidate, only the
individuals who scored between 150 and 159. Of test score is needed to predict what the expected
these 106.51 received ratings of excellent and 38 performance will be. Expectancy tables need not
of above average. Assuming these are the type of be restricted to two variables, but may incorpo-
workers we wish to hire, we would expect a new rate more than one variable that is related to the
applicant to the company who scores between 150 predicted outcome.
and 159 to have a 89/106 or 84% chance to do well
in that company. Note, on the other hand, that of Relativity of norms. John, a high-school stu-
the 62 individuals who scored between 60 and 69, dent, takes a test of mechanical aptitude and
only 1 achieved a rating of excellent, so that we obtains a score of 107. When we compare his
would expect any new applicant with a score of score with available norms, we might find that
60–69 not to do well. In fact, we could calculate his score is at the 85th percentile when compared

Table 2–3
Supervisors’ ratings
Mechanical aptitude Above Below
scores Excellent average Average average Poor
150–159 51 38 16 0 1
140–149 42 23 8 3 0
130–139 20 14 7 2 1
120–129 16 9 3 0 0
110–119 0 2 4 7 8
100–109 1 0 3 12 16
90–99 1 0 0 14 19
80–89 2 1 2 23 23
70–79 0 1 0 19 26
60–69 1 0 0 30 31
Totals: 134 88 43 110 125
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 37

with the high-school sample reported in the test criterion. For example, we may define mental
manual, that his score is at the 72nd percentile retardation not on the basis of a normative IQ,
when compared with students at his own high but whether a child of age 5 can show mastery
school, and that his score is at the 29th percentile of specific tasks such as buttoning her shirt, or
when compared with those applicants who have following specific directions. Or we may admit
been admitted to the prestigious General Dynam- a child to preschool on the basis of whether the
ics School of Automobile Training. Thus differ- child is toilet trained. Or we may administer a test
ent normative groups give different meaning to of Spanish vocabulary and require 80% correct
a particular score, and we need to ask, “Which to register testees for Advanced Spanish.
norm group is most meaningful?” Of course, that Clearly, we must first of all be able to specify
depends. If John is indeed aspiring to be admitted the criterion. Toilet training, mastery of elemen-
to the General Dynamics school, then that nor- tary arithmetic, and automobile driving can all
mative group is more meaningful than the more be defined fairly objectively, and generally agreed
representative but “generic” sample cited in the upon criteria can be more or less specified. But
test manual. there are many variables, many areas of com-
petency, where such criteria cannot be clearly
Local norms. There are many situations where specified.
local norms, data obtained from a local group Second, criteria are not usually arbitrary, but
of individuals, are more meaningful than any are based on real-life observation. Thus, we
national norms that attempt to be representative. would not label a 5-year-old as mentally retarded
If decisions are to be made about an individual if the child did not master calculus because few if
applicant to a particular college or a specific job, any children of that age show such mastery. We
it might be better to have local norms; if career would, however, expect a 5-year-old to be able to
counseling is taking place, then national norms button his shirt. But that observation is in fact
might be more useful. Local norms are desirable based upon norms; so criterion-referenced deci-
if we wish to compare a child’s relative standing sions can be normative decisions, often with the
with other children in the same school or school norms not clearly specified.
district, and they can be especially useful when Finally, we should point out that criterion-
a particular district differs in language and cul- referenced and norm-referenced refer to how the
ture from the national normative sample. How to scores or test results are interpreted, rather than
develop local norms is described in some detail to the tests themselves. So Rebecca’s score of 19
by Kamphaus and Lozano (1984), who give both can be interpreted through norms or by reference
general principles and a specific example. to a criterion.
Criterion-referenced testing has made a sub-
Criterion-referenced testing. You might recall stantial impact, particularly in the field of edu-
being examined for your driving license, either cational testing. To a certain degree, it has forced
through a multiple choice test and/or a driv- test constructors to become more sensitive to the
ing test, and being told, “Congratulations, you’ve domain being assessed, to more clearly and con-
passed.” That decision did not involve comparing cretely specify the components of that domain,
your score or performance against some norms, and to focus more on the concept of mastery
but rather comparing your performance against of a particular domain (Carver, 1974; Shaycoft,
a criterion, a decision rule that was either explicit 1979).
(you must miss less than 6 items to pass) or The term mastery is often closely associated
implicit (the examiner’s judgment that you were with criterion-referenced testing, although other
skillful enough to obtain a driver’s license). terms are used. Carver (1974) used the terms psy-
Glaser (1963) first introduced the term chometric to refer to norm referenced and edu-
criterion-referenced testing and since then the metric to refer to criterion referenced. He argued
procedure has been widely applied, particularly that the psychometric approach focuses on indi-
in educational testing. The intent is to judge a vidual differences, and that item selection and the
person’s performance on a test not on the basis assessment of reliability and validity are deter-
of what others can do, but on the basis of some mined by statistical procedures. The edumetric
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

38 Part One. Basic Issues

approach, on the other hand, focuses on the mea- obtain an index of “physical functioning.” Sta-
surement of gain or growth of individuals, and tistically, we must equate each separate measure-
item selection, reliability and validity, all center ment before we add them up. One easy way to
on the notion of gain or growth. do this, is to change the raw scores into z scores
or T scores. This would make all of Sharon’s ten
scores equivalent psychometrically, with each z
COMBINING TEST SCORES
score reflecting her performance on that variable
Typically, a score that is obtained on a test is (e.g., higher on vocabulary but lower on idioms).
the result of the scoring of a set of items, with The ten z scores could then be added together,
items contributing equal weight, for example and perhaps divided by ten.
1 point each, or different weights (item #6 may Note that we might well wish to argue, either on
be worth one point, but item #18 may be worth theoretical or empirical grounds, that each of the
3 points). Sometimes, scores from various sub- ten tests should not be given equal weight, that for
tests are combined into a composite score. For example, the vocabulary test is most important
example, a test of intelligence such as the Wech- and should therefore be weighted twice as much.
sler Adult Intelligence Scale is composed of eleven Or if we were dealing with a scale of depression,
subtests. Each of these subtests yields a score, and we might argue that an item dealing with suicide
six of these scores are combined into a Verbal IQ, ideation reflects more depression than an item
while the other five scores are combined into a dealing with feeling sad, and therefore should be
Performance IQ. In addition, the Verbal IQ and counted more heavily in the total score. There
the Performance IQ are combined into a Full are a number of techniques, both statistical and
Scale IQ. Finally, scores from different tests or logical, by which differential weighting can be
sources of information may be combined into a used, as opposed to unit weighting, where every
single index. A college admissions officer may, component is given the same scoring weight (see
for example, combine an applicant’s GPA, scores Wang & Stanley, 1970). Under most conditions,
on an entrance examination, and interview infor- unit weighting seems to be as valid as methods
mation, into a single index to decide whether the that attempt differential weighting (F. G. Brown,
applicant should be admitted. There are thus at 1976).
least three basic ways of combining scores, and
the procedures by which this is accomplished are Combining scores using clinical intuition. In
highly similar (F. G. Brown, 1976). many applied situations, scores are combined
not in a formal, statistical manner, but in an
Combining scores using statistics. Suppose we informal, intuitive, judgmental manner. A col-
had administered ten different tests of “knowl- lege admissions officer for example, may consider
edge of Spanish” to Sharon. One test measured an applicant’s grades, letters of recommendation,
vocabulary, another, knowledge of verbs, still a test scores, autobiographical sketch, background
third, familiarity with Spanish idioms, and so variables such as high school attended, and so
on. We are not only interested in each of these on, and combine all of these into a decision of
ten components, but we would like to combine “admit” or “reject.” A personnel manager may
Sharon’s ten different scores into one index that review an applicant’s file and decide on the basis
reflects “knowledge of Spanish.” If the ten tests of a global evaluation, to hire the candidate. This
were made up of one item each, we could of process of “clinical intuition” and whether it is
course simply sum up how many of the ten items more or less valid than a statistical approach has
were answered correctly by Sharon. With tests been studied extensively (e.g., Goldberg, 1968;
that are made up of differing number of items, Holt, 1958; Meehl, 1954; 1956; 1957). Propo-
we cannot calculate such a sum, since each test nents of the intuitive method argue that because
may have a different mean and standard devi- each person is unique, only clinical judgment can
ation, that is represent different scales of mea- encompass that uniqueness; that clinical judg-
surement. This would be very much like adding ment can take into account both complex and
a person’s weight in pounds to their height in atypical patterns (the brilliant student who flunks
inches and their blood pressure in millimeters to high school but does extremely well in medical
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 39

school). Proponents of the statistical approach Multiple regression. Another way of combining
argue that in the long run, better predictive accu- scores statistically is through the use of a mul-
racy is obtained through statistical procedures, tiple regression, which essentially expresses the
and that “intuition” operates inefficiently, if at relationship between a set of variables and a par-
all. ticular outcome that is being predicted. If we had
only one variable, for example IQ, and are pre-
Multiple cutoff scores. One way to statistically dicting GPA, we could express the relationship
combine test scores to arrive at a decision, is to with a correlation coefficient, or with the equa-
use a multiple cutoff procedure. Let us assume we tion of a straight line, namely:
are an admissions officer at a particular college,
Y = a + bX
looking at applications from prospective appli-
cants. For each test or source of information we where Y is the variable being predicted, in this
determine, either empirically or theoretically, a case GPA
cutoff score that separates the range of scores into X is the variable we have measured, in this
two categories, for example “accept” and “reject.” case IQ
Thus if we required our applicants to take an IQ b is the slope of the regression line (which
test, we might consider an IQ of 120 as the mini- tells us as X increases, by how much Y
mum required for acceptance. If we also looked at increases)
high school GPA, we might require a minimum a is the intercept (that is, it reflects the
86% overall for acceptance. These cutoff scores difference in scores between the two
may be based on clinical judgment – “It is my scales of measurement; in this case GPA
opinion that students with an IQ less than 120 is measured on a 4-point scale while IQ
and high school GPA less than 86% do not do has a mean of 100)
well here” – or on statistical evidence – a study of
200 incoming freshmen indicated that the flunk When we have a number of variables, all related
rate of those below the cutoff scores was 71% vs. statistically to the outcome, then the equation
6% for those above the cutoff scores. expands to:
Note that using this system of multiple cut- Y = a + b1 x1 + b2 x2 + bx . . . etc.
offs, a candidate with an IQ of 200 but a GPA
of 82% would not be admitted. Thus we need A nice example of a regression equation can
to ask whether superior performance on one be found in the work of Gough (1968) on a
variable can compensate for poor performance widely used personality test called the California
on another variable. The multiple cutoff pro- Psychological Inventory (CPI). Gough adminis-
cedure is a noncompensatory one and should tered the CPI to 60 airline stewardesses who had
be used only in such situations. For example, if undergone flight training and had received rat-
we were selecting candidates for pilot training ings of in-flight performance (something like a
where both intelligence and visual acuity are nec- final-exam grade). None of the 18 CPI scales indi-
essary, we would not accept a very bright but blind vidually correlated highly with such a rating, but
individual. a four-variable multiple regression not only cor-
There are a number of variations to the basic related +.40 with the ratings of in-flight perfor-
multiple cutoff procedure. For example, the deci- mance, but also yielded an interesting psycho-
sion need not be a dichotomy. We could clas- logical portrait of the stewardesses. The equation
sify our applicants as accept, reject, accept on was:
probation, and hold for personal interview. We
In-flight rating = 64.293 + .227(So)
can also obtain the information sequentially.
−1.903(Cm) + 1.226(Ac ) − .398(Ai )
We might, for example, first require a college
entrance admission test. Those that score above where 64.293 is a weight that allows the two
the cutoff score on that test may be required to sides of the equation to be
take a second test or other procedure and may equated numerically,
then be admitted on the basis of the second cut- So is the person’s score on the Social-
off score. ization scale
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

40 Part One. Basic Issues

Cm is the person’s score on the Com- criteria, although their original inclusion in the
munality scale study might have reflected clinical judgment.
Ac is the person’s score on the
Achievement by Conformance
Discriminant analysis. Another technique that is
scale
somewhat similar to multiple regression is that of
and Ai is the person’s score on the
discriminant analysis. In multiple regression, we
Achievement by Independence
place a person’s scores in the equation, do the
scale
appropriate calculations, and out pops the per-
son’s predicted score on the variable of interest,
Notice that each of the four variables has a
such as GPA. In discriminant analysis we also
number and a sign (+ or −) associated with it.
use a set of variables, but this time we wish to
To predict a person’s rating of in-flight perfor-
predict group membership rather than a con-
mance we would plug in the scores on the four
tinuous score. Suppose for example, that there
variables, multiply each score by the appropriate
are distinct personality differences between col-
weight, and sum to solve the equation. Note that
lege students whose life centers on academic pur-
in this equation, Communality is given the great-
suits (the “geeks”) vs. students whose life cen-
est weight, and Socialization the least, and that
ters on social and extracurricular activities (the
two scales are given positive weights (the higher
“greeks”). John has applied to our university and
the scores on the So and Ac scales, the higher
we wish to determine whether he is more likely
the predicted in-flight ratings), and two scales
to be a geek or a greek. That is the aim of dis-
are given negative weights (the higher the scores
criminant analysis. Once we know that two or
the lower the predicted in-flight rating). By its
more groups differ significantly from each other
very nature, a regression equation gives differen-
on a set of variables, we can assess an individ-
tial weighting to each of the variables.
ual to determine which group that person most
The statistics of multiple regression is a com-
closely resembles. Despite the frivolous nature of
plex topic and will not be discussed here (see
the example, discriminant analysis has the poten-
J. Cohen & P. Cohen, 1983; Kerlinger & Ped-
tial to be a powerful tool in psychiatric diagnosis,
hazur, 1973; Pedhazur, 1982; Schroeder, Sjoquist,
career counseling, suicide prevention, and other
& Stephan, 1986), but there are a number of
areas (Tatsuoka, 1970).
points that need to be mentioned.
First of all, multiple regression is a compen-
satory model, that is, high scores on one vari-
SUMMARY
able can compensate for low scores on another
variable. Second, it is a linear model, that is, In this chapter we have looked at three basic
it assumes that as scores increase on one vari- issues: the construction, the administration, and
able (for example IQ), scores will increase on the interpretation of tests. Test construction
the predicted variable (for example, GPA). Third, involves a wide variety of procedures, but for our
the variables that become part of the regression purposes we can use a nine-step model to under-
equation are those that have the highest corre- stand the process. Test items come in all shapes
lations with the criterion and low correlations and forms, though some, like multiple choice,
with the other variables in the equation. Note seem to be more common. Test construction is
that in the CPI example above, there were 18 not a mere mechanical procedure, but in part
potential variables, but only 4 became part of the involves some basic philosophical issues. A pri-
regression equation. Thus, additional variables mary issue in test administration is that of estab-
will not become part of the equation even if they lishing rapport. Once the test is administered
correlate with the criterion but do not add some- and scored, the raw scores need to be changed
thing unique, that is, have low or zero correlations into derived scores, including percentiles, stan-
with the other variables. In most practical cases, dard scores, T scores, or stanines. Two aspects
regression equations are made up of about two of test items are of particular interest to test
to six variables. The variables that are selected for constructors: item difficulty and item discrim-
the equation are selected on the basis of statistical ination. Finally, we need to interpret a raw score
P1: JZP
0521861810c02 CB1038/Domino 0 521 86181 0 March 6, 2006 13:32

Test Construction, Administration, and Interpretation 41

in terms of available norms or a criterion. Scores The author reports on a study where introductory psychology
can also be combined in a number of ways. students were administered multiple-choice questions with
an option to explain their answers. Such items were preferred
by the students and found to be less frustrating and anxiety
producing.
SUGGESTED READINGS

Dawis, R. V. (1987). Scale construction. Journal of Zimmerman, M., & Coryell, W. (1987). The Inventory
Counseling Psychology, 34, 481–489. to Diagnose Depression (IDD): A self-report scale to
diagnose major depressive disorder. Journal of Consult-
This article discusses the design, development, and evaluation
ing and Clinical Psychology, 55, 55–59.
of scales for use in counseling psychology research. Most of
the methods discussed in this article will be covered in later The authors report on the development of a 22-item self-
chapters, but some of the basic issues are quite relevant to this report scale to diagnose depression. The procedures and
chapter. methodologies used are fairly typical and most of the article
is readable, even if the reader does not have a sophisticated
Hase, H. D., & Goldberg, L. R. (1967). Comparative statistical background.
validity of different strategies of constructing person-
ality inventory scales. Psychological Bulletin, 67, 231–
248. DISCUSSION QUESTIONS
This is an old but still fascinating report. The authors identify 1. Locate a journal article that presents the devel-
six strategies by which personality inventory scales can be
developed. From the same item pool, they constructed sets
opment of a new scale (e.g., Leichsenring, 1999).
of 11 scales by each of the 6 strategies. They then compared How does the procedure compare and contrast
these 66 scales with 13 criteria. Which set of scales, which type with that discussed in the text?
of strategy, was the best? To find the answer, check the report
out!
2. Select a psychological variable that is of inter-
est to you (e.g., intelligence, depression, com-
Henderson, M., & Freeman, C. P. L. (1987). A self- puter anxiety, altruism, etc.). How might you
rating scale for bulimia. The “BITE.” British Journal of develop a direct assessment of such a variable?
Psychiatry, 150, 18–24.
3. When your instructor administers an exami-
There is a lot of interest in eating disorders, and these authors
report on the development of a 36-item scale composed of
nation in this class, the results will most likely be
two subscales – the Symptom Subscale and the Severity scale, reported as raw scores. Would derived scores be
designed to measure binge eating. Like the study by Zim- better?
merman and Coryell (1987) listed next, this study uses fairly
typical procedures, and reflects at least some of the steps men-
4. What are the practical implications of chang-
tioned in this chapter. ing item difficulty?
Nield, A. F. (1986). Multiple-choice questions with an 5. What kind of norms would be useful for a
option to comment: Student attitudes and use. Teach- classroom test? For a test of intelligence? For a
ing of Psychology, 13, 196–199. college entrance exam?
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

3 Reliability and Validity

AIM This chapter introduces the concepts of reliability and of validity as the two
basic properties that every measuring instrument must have. These two properties are
defined and the various subtypes of each discussed. The major focus is on a logical
understanding of the concepts, as well as an applied understanding through the use
of various statistical approaches.

INTRODUCTION necessarily mean sameness. A radar gun that


always indicates 80 miles per hour even when
Every measuring instrument, whether it is a yard-
it is pointed at a stationary tree does not have
stick or an inventory of depression, must have
reliability. Similarly, a bathroom scale that works
two properties: the instrument must yield con-
accurately except for Wednesday mornings when
sistent measurement, i.e., must be reliable, and
the weight recorded is arbitrarily increased by
the instrument must in fact measure the variable
three pounds, does have reliability.
it is said to measure, i.e., must be valid. These two
Note that reliability is not a property of a test,
properties, reliability and validity, are the focus
even though we speak of the results as if it were
of this chapter.
(for example, “the test-retest reliability of the
Jones Depression Inventory is .83”). Reliability
really refers to the consistency of the data or the
RELIABILITY
results obtained. These results can and do vary
Imagine that you have a rich uncle who has just from situation to situation. Perhaps an analogy
returned from a cruise to an exotic country, and might be useful. When you buy a new automo-
he has brought you as a souvenir a small ruler – bile, you are told that you will get 28 miles per
not a pygmy king, but a piece of wood with mark- gallon. But the actual mileage will be a func-
ings on it. Before you decide that your imagi- tion of how you drive, whether you are pulling
nary uncle is a tightwad, I should tell you that a trailer or not, how many passengers there are,
the ruler is made of an extremely rare wood with whether the engine is well tuned, etc. Thus the
an interesting property – the wood shrinks and actual mileage will be a “result” that can change as
expands randomly – not according to humid- aspects of the situation change (even though we
ity or temperature or day of the week, but ran- would ordinarily not expect extreme changes –
domly. If such a ruler existed it would be an even the most careful driver will not be able to
interesting conversation piece, but as a measur- decrease gas consumption to 100 miles per gal-
ing instrument it would be a miserable failure. lon) (see Thompson & Vacha-Haase, 2000).
Any measuring instrument must first of all yield
consistent measurement; the actual measurement
should not change unless what we are measur- True vs. error variation. What then is reliability?
ing changes. Consistency or reliability does not Consider 100 individuals of different heights.
42
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 43

When we measure these heights we will find scores and the second. By convention, a corre-
variation, statistically measured by variance (the lation coefficient that reflects reliability should
square of the standard deviation). Most of the reach the value of .70 or above for the test to be
variation will be “true” variation – that is, people considered reliable.
really differ from each other in their heights. Part The determination of test-retest reliability
of the variation however, will be “error” varia- appears quite simple and straightforward, but
tion, perhaps due to the carelessness of the person there are many problems associated with it. The
doing the measuring, or a momentary slouching first has to do with the “suitable” interval before
of the person being measured, or how long the retesting. If the interval is too short, for exam-
person has been standing up as opposed to lying ple a couple of hours, we may obtain substan-
down, and so on. Note that some of the error tial consistency of scores, but that may be more
variation can be eliminated, and what is consid- reflective of the relative consistency of people’s
ered error variation in one circumstance may be memories over a short interval than of the actual
a legitimate focus of study in another. For exam- measurement device. If the interval is quite long,
ple, we may be very interested in the amount of for example a couple of years, then people may
“shrinkage” of the human body that occurs as a have actually changed from the first testing to
function of standing up for hours. the second testing. If everyone in our sample
How is reliability determined? There are basi- had changed by the same amount, for example
cally four ways: test-retest reliability, alternate (or had grown 3 inches, that would be no problem
equivalent) forms reliability, split-half reliability, since the consistency (John is still taller than Bill)
and interitem consistency. would remain. But of course, people don’t change
in just about anything by the same amount, so
there would be inconsistency between the first
TYPES OF RELIABILITY
and second set of scores, and our instrument
Test-retest reliability. You have probably experi- would appear to be unreliable whereas in fact
enced something like this: you take out your purse it might be keeping track of such changes. Typ-
or wallet, count your money, and place the wal- ically, changes over a relatively longer period of
let back. Then you realize that something is not time are not considered in the context of reliabil-
quite right, take the wallet out again and recount ity, but are seen as “true” changes.
your money to see if you obtain the same result. Usually then, test-retest reliability is assessed
In fact, you were determining test-retest reliabil- over a short period of time (a few days to a few
ity. Essentially then, test-retest reliability involves weeks or a few months), and the obtained cor-
administering a test to a group of individuals and relation coefficient is accompanied by a descrip-
retesting them after a suitable interval. We now tion of what the time period was. In effect, test-
have two sets of scores for the same persons, and retest reliability can be considered a measure of
we compare the consistency of these two sets typ- the stability of scores over time. Different peri-
ically by computing a correlation coefficient. You ods of time may yield different estimates of sta-
will recall that the most common type of corre- bility. Note also that some variables, by their very
lation coefficient is the Pearson product moment nature, are more stable than others. We would not
correlation coefficient, typically abbreviated as r, expect the heights of college students to change
used when the two sets of scores are continu- over a two-week period, but we would expect
ous and normally distributed (at least theoret- changes in mood, even within an hour!
ically). There are other correlation coefficients Another problem is related to motivation. Tak-
used with different kinds of data, and these are ing a personality inventory might be interesting
briefly defined and illustrated in most introduc- to most people, but taking it later a second time
tory statistics books. might not be so exciting. Some people in fact
You will also recall that correlation coefficients might become so bored or resentful as to per-
can vary from zero, meaning that there is no rela- haps answer randomly or carelessly the second
tionship between one set of scores and the sec- time around. Again, since not everyone would
ond set, to a plus or minus 1.00, meaning that become careless to the same degree, retest scores
there is a perfect relationship between one set of would change differently for different people,
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

44 Part One. Basic Issues

and therefore the proportion of error variation number of items we can generate for an alter-
to true variation would become larger; hence nate form, but if we are developing a test to assess
the size of the correlation coefficient would be depression, the number of available items related
smaller. to depression is substantially smaller.
There are a number of other problems with Let’s assume you have developed a 100-item,
test-retest reliability. If the test measures some multiple-choice vocabulary test composed of
skill, the first administration may be perceived as items such as:
a “practice” run for the second administration,
donkey = (a) feline, (b) canine, (c) aquiline, (d) asinine
but again not everyone will improve to the same
degree on the second administration. If the test You have worked for five years on the project,
involves factual knowledge, such as vocabulary, tried out many items, and eliminated those that
some individuals might look up some words in were too easy or too difficult, those that showed
the dictionary after the first administration and gender differences, those that reflected a person’s
thus change their scores on the second adminis- college major, and so on. You now have 100 items
tration, even if they didn’t expect a retesting. that do not show such undue influences and are
told that you must show that your vocabulary test
Alternate form reliability. A second way to mea- is indeed reliable. Test-retest reliability does not
sure reliability is to develop two forms of the same seem appropriate for the reasons discussed above.
test, and to administer the two forms either at dif- In effect, you must go back and spend another 5
ferent times or in succession: Good experimental years developing an alternate form. Even if you
practice requires that to eliminate any practice or were willing to do so, you might find that there
transfer effects, half of the subjects take form A just are not another 100 items that are equivalent.
followed by form B, and half take form B followed Is there a way out? Yes, indeed there is; that is the
by form A. The two forms should be equivalent in third method of assessing reliability, known as
all aspects – instructions, number of items, etc. – split-half reliability.
except that the items are different. This approach
would do away with some of the problems men- Split-half reliability. We can administer the 100-
tioned above with test-retest reliability, but would item vocabulary test to a group of subjects, and
not eliminate all of them. then for each person obtain two scores, the num-
If the two forms of the test are administered in ber correct on even-numbered items and the
rapid succession, any score differences from the number correct on odd-numbered items. We can
first to the second form for a particular individ- then correlate the two sets of scores. In effect,
ual would be due to the item content, and thus we have done something that is not very differ-
reliability could be lowered due to item sampling, ent from alternate-form reliability; we are mak-
that is the fact that our measurement involves two ing believe that the 100-item test is really two,
different samples of items, even though they are 50-item tests. The reliability estimate we com-
supposed to be equivalent. If the two forms are pute will be affected by item sampling – the
administered with some time interval between odd-numbered items are different from the even-
them, then our reliability coefficient will reflect numbered items, but will not be affected by tem-
the variation due to both item sampling and tem- poral stability because only one administration
poral aspects. is involved.
Although it is desirable to have alternate forms There is however, an important yet subtle dif-
of the same test to reduce cheating, to assess the ference between split-half reliability and test-
effectiveness of some experimental treatment, or retest. In test-retest, reliability was really a reflec-
to maintain the security of a test (as in the case of tion of temporal stability; if what was being mea-
the GRE), the major problem with alternate form sured did not appreciably change over time, then
reliability is that the development of an alter- our measurement was deemed consistent or reli-
nate form can be extremely time consuming and able. In split-half reliability the focus of consis-
sometimes simply impossible to do, particularly tency has changed. We are no longer concerned
for tests that are not commercially published. If about temporal stability, but are now concerned
we are developing a test to measure knowledge of with internal consistency. Split-half reliability
arithmetic in children, there is almost an infinite makes sense to the degree that each item in
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 45

our vocabulary test measures the same variable, give you a more “stable” idea of what the chef
that is to the degree that a test is composed of can do than only two visits. There is a formula
homogeneous items. Consider a test to measure that allows us to estimate the reliability of the
arithmetic where the odd-numbered items are entire test from a split-half administration, and
multiplication items and the even-numbered it is called the Spearman-Brown formula:
items deal with algebraic functions. There may k (obtained r )
not be a substantial relationship between these estimated r =
1 + (k − 1)(obtained r )
two areas of arithmetic knowledge, and a com-
puted correlation coefficient between scores on In the formula, k is the number of times the test is
the two halves might be low. This case should lengthened or shortened. Thus, in split-half reli-
not necessarily be taken as evidence that our test ability, k becomes 2 because we want to know the
is unreliable, but rather that the split-half proce- reliability of the entire test, a test that is twice as
dure is applicable only to homogeneous tests. A long as one of its halves. But the Spearman-Brown
number of psychologists argue that indeed most formula can be used to answer other questions as
tests should be homogeneous, but other psychol- these examples indicate:
ogists prefer to judge tests on the basis of how
EXAMPLE 1 I have a 100-item test whose split-
well they work rather than on whether they are
half reliability is .68. What is the reliability of the
homogeneous or heterogeneous in composition.
total test?
In psychological measurement, it is often diffi-
cult to assess whether the items that make up a 2(.68) 1.36
estimated r = = = .81
scale of depression, or anxiety, or self-esteem are 1 + (1)(.68) 1.68
psychometrically consistent with each other or
reflect different facets of what are rather complex EXAMPLE 2 I have a 60-item test whose reliabil-
and multidimensional phenomena. ity is .61; how long must the test be for its relia-
There are of course many ways to split a test bility to be .70? (Notice we need to solve for k.)
in half to generate two scores per subject. For k(.61)
our 100-item vocabulary test, we could score the .70 =
1 + (k − 1)(.61)
first 50 items and the second 50 items. Such a
cross-multiplying we obtain:
split would ordinarily not be a good procedure
because people tend to get more tired toward the k(.61) = .70 + .70(k − 1)(.61)
end of a test and thus would be likely to make
k(.61) = .70 + (.427)(k − 1)
more errors on the second half. Also, items are
often presented within a test in order of difficulty, k(.61) = .70 + .427k − .427
with easy items first and difficult items later; this k(.183) = .273
might result in almost everyone getting higher k = 1.49
scores on the first half of the test and differing
on the second half – a state of affairs that would the test needs to be about 1.5 times as long or
result in a rather low correlation coefficient. You about 90 items (60 × 1.5).
can probably think of more complicated ways to
EXAMPLE 3 Given a 300-item test whose relia-
split a test in half, but the odd vs. even method
bility is .96, how short can the test be to have its
usually works well. In fact, split-half reliability is
reliability be at least .70? (Again, we are solving
often referred to as odd-even reliability.
for k.)
Each half score represents a sample, but the
computed reliability is based only on half of the k(.96)
.70 =
items in the test, because we are in effect com- 1 + (k − 1)(.96)
paring 50 items vs. 50 items, rather than 100 k(.96) = .70 + .70(.96)(k − 1)
items. Yet from the viewpoint of item sampling k(.96) = .70 + .672(k − 1)
(not temporal stability), the longer the test the k(.96) = .70 + .672k − .672
higher will its reliability be (Cureton, 1965; Cure-
k(.96) = .028 = .672k
ton, et al., 1973). All other things being equal, a
100-item test will be more reliable than a 50-item k(.288) = 0.28
test – going to a restaurant 10 different times will k = .097
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

46 Part One. Basic Issues

The test can be about one tenth of this length, or For example, instead of just asking do you agree
30 items long (300 × .097). or disagree, we could use a five-point response
The calculations with the Spearman-Brown scale of strongly agree, agree, undecided, dis-
formula assume that when a test is shortened agree, strongly disagree. Another way to increase
or lengthened, the items that are eliminated or variability is to increase the number of items –
added are all equal in reliability. In fact such is not a 10-item true-false scale can theoretically yield
the case, and it is quite possible to increase the scores from 0 to 10, but a 25-item scale can yield
reliability of a test by eliminating the least reli- scores from 0 to 25, and that of course is pre-
able items. In this context, note that reliability cisely the message of the Spearman-Brown for-
can be applied to an entire test or to each item. mula. Still another way to increase variability is
to develop test items that are neither too easy nor
The Rulon formula. Although the Spearman- too difficult for the intended consumer, as we
Brown formula is probably the most often cited also discussed in Chapter 2. A test that is too easy
and used method to compute the reliability of would result in too many identical high scores,
the entire test, other equivalent methods have and a test that is too difficult would result in too
been devised (e.g., Guttman, 1945; Mosier, 1941; many identical low scores. In either case, variabil-
Rulon, 1939). The Rulon formula is: ity, and therefore reliability, would suffer.
variance of differences
estimated r = 1 − Two halves = four quarters. If you followed the
variance of total scores
discussion up to now, you probably saw no logical
For each person who has taken our test, we fallacy in taking a 100-item vocabulary test and
generate four scores: the score on the odd items; generating two, scores for each person, as if in
the score on the even items, a difference score fact you had two, 50-item tests. And indeed there
(score on the odd items minus score on the even is none. Could we not argue however, that in fact
items), and a total score (odd plus even). We then we have 4 tests of 25 items each, and thus we
compute the variance of the difference scores and could generate four scores for each subject? After
the variance of the total scores to plug into the all, if we can cut a pie in two, why not in four?
formula. Note that if the scores on the two halves Indeed, why not argue that we have 10 tests of
were perfectly consistent, there would be no vari- 10 items each, or 25 tests of 4 items each, or 100
ation between the odd item score and the even tests of 1 item each! This leads us to the fourth
item score, and so the variance of the difference way of determining reliability, known as interitem
scores would be zero, and therefore the estimated consistency.
r would equal 1. The ratio of the two variances in
fact reflects the proportion of error variance that Interitem consistency. This approach assumes
when subtracted from 1 leaves the proportion of that each item in a test is in fact a measure of the
“true” variance, that is, the reliability. same variable, whatever that may be, and that we
can assess the reliability of the test by assessing the
Variability. As discused in Chapter 2, variabil- consistency among items. This approach rests on
ity of scores among individuals, that is, individ- two assumptions that are often not recognized
ual differences, makes statistical calculations such even by test “experts.” The first is that interitem
as the correlation coefficient possible. The item, reliability, like split-half reliability, is applicable
“Are you alive as you read this?” is not a good test and meaningful only to the extent that a test is
item because it would yield no variability – every- made up of homogeneous items, items that all
one presumably would give the same answer. assess the same domain. The key word of course
Similarly, gender as defined by “male” or “female” is “same.” What constitutes the same domain?
yields relatively little variability, and from a psy- You have or will be taking an examination in this
chometric point of view, gender thus defined is course, most likely made up of multiple-choice
not a very useful measure. All other things being items. All of the items focus on your knowledge of
equal, the greater the variability in test scores psychological testing, but some of the items may
the better off we are. One way to obtain such require rote memory, others, recognition of key
variability is to increase the range of responses. words, still others, the ability to reason logically,
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 47

and others, perhaps the application of formulas. Sources of error. The four types of reliability just
Do these items represent the same or different discussed all stem from the notion that a test score
domains? We can partially answer this statisti- is composed of a “true” score plus an “error” com-
cally, through factor analysis. But if we compute ponent, and that reliability reflects the relative
an interitem consistency reliability correlation ratio of true score variance to total or observed
coefficient, and the resulting r is below .70, we score variance; if reliability were perfect, the error
should not necessarily conclude that the test is component would be zero.
unreliable. A second approach to reliability is based on
A second assumption that lurks beneath generalizability theory, which does not assume
interitem consistency is the notion that if each that a person has a “true” score on intelligence, or
item were perfectly reliable, we would only obtain that error is basically of one kind, but argues that
two test scores. For example, in our 100-item different conditions may result in different scores,
vocabulary test, you would either know the and that error may reflect a variety of sources
meaning of a word or you would not. If all the (Brennan, 1983; Cronbach, Gleser, Rajaratnam,
items are perfectly consistent, they would be per- & Nanda, 1972; see Lane, Ankenmann, & Stone,
fectly related to each other, so that people taking 1996, for an example of generalizability theory
the test would either get a perfect score or a zero. as applied to a Mathematics test). The interest
If that is the case, we would then only need 1 here is not only in obtaining information about
item rather than 100 items. In fact, in the real the sources of error, but in systematically vary-
world items are not perfectly reliable or consis- ing those sources and studying error experimen-
tent with each other, and the result is individual tally. Lyman (1978) suggested five major sources
differences and variability in scores. In the real of error for test scores:
world also, people do not have perfect vocabu-
lary or no vocabulary, but differing amounts of 1. The individual taking the test. Some individu-
vocabulary. als are more motivated than others, some are less
attentive, some are more anxious, etc.
Measuring interitem consistency. How is 2. The influence of the examiner, especially on
interitem consistency measured? There are two tests that are administered to one individual at
formulas commonly used. The first is the Kuder- a time. Some of these aspects might be whether
Richardson formula 20, sometimes abbreviated the examiner is of the same race, gender, etc., as
as K-R 20 (Kuder & Richardson, 1937), which the client, whether the examiner is (or is seen as)
is applicable to tests whose items can be scored caring, authoritarian, etc.
on a dichotomous (e.g., right-wrong; true-false; 3. The test items themselves. Different items
yes-no) basis. The second formula is the coef- elicit different responses.
ficient alpha, also known as Cronbach’s alpha 4. Temporal consistency. For example, intelli-
(Cronbach, 1951), for tests whose items have gence is fairly stable over time, but mood may
responses that may be given different weights – not be.
for example, an attitude scale where the response 5. Situational aspects. For example, noise in the
“never” might be given 5 points, “occasionally” hallway might distract a person taking a test.
4 points, etc. Both of these formulas require
the data from only one administration of the We can experimentally study these sources of
test and both yield a correlation coefficient. It is variation and statistically measure their impact,
sometimes recommended that Cronbach’s alpha through such procedures as analysis of variance,
be at least .80 for a measure to be considered to determine which variables and conditions cre-
reliable (Carmines & Zeller, 1979). However ate lessen reliability. For example, whether the
alpha increases as the number of items increases retest is 2 weeks later or 2 months later might
(and also increases as the correlations among result in substantial score differences on test X,
items increase), so that .80 may be too harsh of but whether the administrator is male or female
a criterion for shorter scales. (For an in-depth might result in significant variation in test scores
discussion of coefficient alpha, see Cortina, for male subjects but not for female subjects. (See
1993). Brennan, 1983, or Shavelson, Webb, & Rowley,
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

48 Part One. Basic Issues

1989, for a very readable overview of generaliz- Chance. One of the considerations associated
ability theory.) with scorer or rater reliability is chance. Imag-
ine two raters observing a videotape of a therapy
Scorer reliability. Many tests can be scored in session, and rating the occurrence of every behav-
a straightforward manner: The answer is either ior that is reflective of anxiety. By chance alone,
correct or not, or specific weights are associated the observers could agree 50% of the time, so
with specific responses, so that scoring is primar- our reliability coefficient needs to take this into
ily a clerical matter. Some tests however, are fairly account: What is the actual degree of agreement
subjective in their scoring and require consider- over and above that due to chance? Several statis-
able judgment on the part of the scorer. Con- tical measures have been proposed, but the one
sider for example, essay tests that you might have that is used most often is the Kappa coefficient
taken in college courses. What constitutes an “A” developed by Cohen (1960; see also Hartmann,
response vs. a “B” or a “C” can be fairly arbi- 1977). We could of course have more than two
trary. Such tests require that they be reliable not raters. For example, each application to a grad-
only from one or more of the standpoints we uate program might be independently rated by
have considered above, but also from the view- three faculty members, but not all applications
point of scorer reliability – would two different would be rated by the same three faculty. Pro-
scorers arrive at the same score when scoring cedures to measure rater reliability under these
the same test protocol? The question is answered conditions are available (e.g., Fleiss, 1971).
empirically; a set of test protocols is indepen-
dently given to two or more scorers and the result-
Interobserver reliability. At the simplest level,
ing two or more sets of scores are compared, usu-
we have two observers independently observing
ally with a correlation coefficient, or sometimes
an event – e.g., did Brian hit Marla? Schemati-
by indicating the percentage of agreement (e.g.,
cally, we can describe this situation as:
Fleiss, 1975).
Quite often, the scorers need to be trained
to score the protocols, especially with scoring Observer 2
sophisticated psychological techniques such as
the Rorschach inkblot test, and the resulting cor- Yes No
relation coefficient can be in part reflective of the
effectiveness of the training. Note that, at least
theoretically, an objectively scored test could have
Yes A B
a very high reliability, but a subjectively scored
version of the same test would be limited by Observer 1
the scorer reliability (for example, our 100-item
vocabulary test could be changed so that subjects No C D
are asked to define each word and their defini-
tions would be judged as correct or not). Thus,
one way to improve reliability is to use test items
that can be objectively scored, and that is one
of several reasons why psychometricians prefer
multiple-choice items to formats such as essays. Cells A and D represent agreements, and cells
B and C represent disagreements. From this
Rater reliability. Scorer reliability is also referred
simple schema some 17 different ways of mea-
to as rater reliability, when we are dealing with suring observer reliability have been developed,
ratings. For example, suppose that two faculty although most are fairly equivalent (A. E. House,
members independently read 80 applications to B. J. House, & Campbell, 1981). For example, we
their graduate program and rate each application can compute percentage agreement as:
as “accept,” “deny,” or “get more information.”
Would the two faculty members agree with each A+ D
other to any degree? Percentage agreement = ×100
A+ B +C + D
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 49

From the same schema we can also compute The standard error of measurement. Knowing
coefficient Kappa, which is defined as: the reliability coefficients for a particular test
Po − Pe gives us a picture of the stability of that test.
Knowing for example, that the test-retest relia-
1 − Pe
bility of our 100-item vocabulary test is .92 over a
where Po is the observer proportion of agreement 6-month period tells us that our measure is fairly
and Pe is the expected or chance agreement. stable over a medium period of time; knowing
To calculate Kappa, see Fleiss (1971) or Shrout, that in a sample of adults, the test-retest relia-
Spitzer, and Fleiss (1987). bility is .89 over a 6-year period, would also tell
us that vocabulary is not easily altered by dif-
Correction for attenuation. Reliability that is fering circumstances over a rather long period of
less than perfect, as it typically is, means that there time. Notice however, that to a certain degree this
is “noise in the system,” much like static on a tele- approach does not focus on the individual sub-
phone line. But just as there are electronic means ject. To compute reliability the test constructor
to remove that static, there are statistical means simply administers the test to a group of subjects,
by which we can estimate what would happen if chosen because of their appropriateness (e.g.,
we had a perfectly reliable test. That procedure is depressed patients) or quite often because of their
called correction for attenuation and the formula availability (e.g., college sophomores). Although
is: the obtained correlation coefficient does reflect
r 12 the sample upon which it is based, the psycho-
r estimated = √
r 11r 22 metrician is more interested in the test than in the
subjects who took the test. The professional who
where restimated is the “true” correlation uses a test, however, a clinical psychologist, a per-
between two measures if both sonnel manager, or a teacher, is very interested in
the test and the second measure the individual, and needs therefore to assess reli-
were perfectly reliable; ability from the individual point of view. This is
r12 is the observed correlation done by computing the standard error of measure-
between the test and the second ment (SEM).
measure; Imagine the following situation. I give Susan,
r11 is the reliability of the test; and a 10-year-old, an intelligence test and I calculate
r22 is the reliability of the second her IQ, which turns out to be 110. I then give
measure. her a magic pill that causes amnesia for the test-
ing, and I retest her. Because the test is not per-
For example, assume there is a correlation fectly reliable, because Susan’s attention might
between the Smith scholastic aptitude test and wander a bit more this second time, and because
grades of .40; the reliability of the Smith is .90 she might make one more lucky guess this time,
and that of grades is .80. The estimated true cor- and so on, her IQ this second time turns out to
relation between the Smith test and GPA is: be 112. I again give her the magic pill and test
.40 .40 her a third time, and continue doing this about
= = = .47
(.90)(.80) .85 5,000 times. The distribution of 5,000 IQs that
belong to Susan will differ, not by very much,
You might wonder how the reliability of GPA
but perhaps they can go from a low of 106 to a
might be established? Ordinarily of course, we
high of 118. I can compute the mean of all of
would have to assume that grades are measured
these IQs and it will turn out that the mean will
without error because we cannot give grades
in fact be her “true” IQ because error deviations
twice or compare grades in the first three courses
are assumed to cancel each other out – for every
one takes vs. the last three courses in a semester.
lucky guess there will be an unlucky guess. I can
In that case, we would assign a 1 to r22 and so the
also calculate the variation of these 5,000 IQs by
formula would simplify to:
computing the standard deviation. Because this
r 12
r estimated = √ is a very special standard deviation (for one thing,
r 11 it is a theoretical notion based on an impossible
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

50 Part One. Basic Issues

110 Susan's known IQ score.

34% 34%

13% 13%

3% 3%
−2 SD −1 SD x +1 SD +2 SD

the mean - 2 SEM 100.6 105.3 110 114.7 119.4 the mean + 2 SEM

the mean - 1 SEM the mean + 1 SEM


FIGURE 3–1. Hypothetical distribution of Susan’s IQ scores.

example), it is given a special name: the standard fore, we can assume that the probability of Susan’s
error of measurement or SEM (remember that “true” IQ being between 105.3 and 114.7 is 68%,
the word standard really means average). This and that the probability of her “true” IQ being
SEM is really a standard deviation: it tells us how between 100.6 and 119.4 is 94%. Note that as the
variable Susan’s scores are. SD of scores is smaller and the reliability coeffi-
In real life of course, I can only test Susan once cient is higher, the SEM is smaller. For example,
or twice, and I don’t know whether the obtained with an SD of 5, the
IQ is near her “true” IQ or is one of the extreme 
values. I can however, compute an estimate of the SEM = 5 (1−.90) = 1.58
SEM by using the formula: with an SD of 5 and a reliability coefficient of .96
√ the
SEM = SD 1 − r 11 
where SD is the standard deviation of scores on SEM = 5 (1−.96) = 1.
the test, and Don’t let the statistical calculations make you
r11 is the reliability coefficient. lose sight of the logic. When we administer a test
Let’s say that for the test I am using with Susan, there is “noise in the system” that we call error
the test manual indicates that the SD = 15 and the or lack of perfect reliability. Because of this, an
reliability coefficient is .90. The SEM is therefore obtained score of 120 could actually be a 119 or
equal to: a 122, or a 116 or a 125. Ordinarily we don’t
 expect that much noise in the system (to say that
15 (1−.90) or 4.7 Susan’s IQ could be anywhere between 10 and 300
How do we use this information? Remember that is not very useful) but in fact, most of the time,
a basic assumption of statistics is that scores, at the limits of a particular score are relatively close
least theoretically, take on a normal curve distri- together and are estimated by the SEM, which
bution. We can then imagine Susan’s score dis- reflects the reliability of a test as applied to a par-
tribution (the 5,000 IQs if we had them) to look ticular individual.
like the graph in Figure 3.1.
We only have one score, her IQ of 110, and The SE of differences. Suppose we gave Alicia
we calculated that her scores would on the aver- a test of arithmetic and a test of spelling. Let’s
age deviate by 4.7 (the size of the SEM). There- assume that both tests yield scores on the same
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 51

34% 34%

13%

3%

This line divides the extreme 5% of the area


from the other 95%. If our results is "extreme,"
that is, falls in that 5% area, we decide that
the two scores do indeed differ from each other.
FIGURE 3–2. Normal curve distribution.

numerical scale – for example, an average of 100 Suppose for example, that the two tests Alicia
and a SD of 10 – and that Alicia obtains a score took both have a SD of 10, and the reliability of
of 108 on arithmetic and 112 on spelling. Can we the arithmetic test is .95 and that of the spelling
conclude that she did better on the spelling test? test is .88. The SED would equal:
Because there is “noise” (that is, unreliability) √
on both tests, that 108 on arithmetic could be 10 2−.95−.88 or 4.1.
110, and that 112 on spelling could be 109, in We would accept Alicia’s two scores as being dif-
which case we would not conclude that she did ferent, if the probability of getting such a differ-
better on spelling. How can we compare her two ence by chance alone is 5 or fewer times out of
scores from a reliability framework? The answer 100, i.e., p < .05. You will recall that such a prob-
again lies in the standard error, this time called ability can be mapped out on the normal curve
the standard error of differences, SED. Don’t lose to yield a z score of +1.96. We would therefore
sight of the fact that the SE is really a SD telling us take the SED of 4.1 and multiply it by 1.96 to yield
by how much the scores deviate on the average. approximately 8, and would conclude that Alicia’s
The formula for the SED is: two scores are different only if they differ by at
 least 8 points; in the example above they do not,
SED = (SEM)21 + (SEM)22 and therefore we cannot conclude that she did
better on one test than the other (see Figure 3.2).
which turns out to be equal to
Reliability of difference scores. Note that in

SED = SD 2−r 11 − r 22 the above section we focused on the difference
between two scores. Quite often the clinical psy-
where the first SEM and the first r refer to the chologist, the school or educational psychologist,
first test or even a researcher, might be more interested
and the second SEM and the second r in the relationship of pairs of scores rather than
refer to the second test individual single scores; we might for example
and SD = the standard deviation (which be interested in relating discrepancies between
is the same for both tests). verbal and nonverbal intelligence to evidence of
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

52 Part One. Basic Issues

possible brain damage, and so we must inquire A second category of tests, requiring special
into the reliability of difference scores. Such reli- techniques, are criterion-referenced tests, where
ability is not the sum of the reliability of the performance is interpreted not in terms of norms
two scores taken separately because the difference but in terms of a pass-fail type of decision (think
score is not only affected by the errors of mea- of an automobile driving test where you are either
surement of each test, but is also distinguished by awarded a license or not). Special techniques have
the fact that whatever is common to both mea- been developed for such tests (e.g., Berk, 1984).
sures is canceled out in the difference score –
after all, we are looking at the difference. Thus
VALIDITY
the formula for the reliability of difference scores
is: Consider the following: Using a tape measure,
1/2(r + r ) − r measure the circumference of your head and mul-
11 22 12
r difference = tiple the resulting number by 6.93. To this, add
1 − r 12
three times the number of fingers on your left
For example, if the reliability of test A is .75 and hand, and six times the number of eyeballs that
that of test B is .90, and the correlation between you have. The resulting number will be your IQ.
the two tests is .50 then When I ask students in my class to do this, most
1/2(.75 + .90) − .50 .325
r difference = = = .65 stare at me in disbelief, either wondering what
1 − .50 .50 the point of this silly exercise is, or whether I have
In general, when the correlation between two finally reached full senility! The point, of course,
tests begins to approach the average of their sep- is that such a procedure is extremely reliable,
arate reliability coefficients, the reliability of the assuming your head doesn’t shrink or expand,
difference score lowers rapidly. For example, if and that you don’t lose any body parts between
the reliability of test A is .70, that of test B is also test and retest. But reliability is not sufficient.
.70, and the correlation between the two tests is Once we have established that a test is reli-
.65, then able, we must show that it is also valid, that it
1/2(.70 + .70) − .65 measures what it is intended to measure. Does
.05
r difference = = = .14 a test of knowledge of arithmetic really measure
1 − .65 .35 that knowledge, or does it measure the ability to
The point here is that we need to be very care- follow directions, to read, to be a good guesser,
ful when we make decisions based on difference or general intelligence? Whether a test is or is
scores. We should also reiterate that to compare not valid depends in part on the specific pur-
the difference between two scores from two dif- pose for which it is used. A test of knowledge of
ferent tests, we need to make sure that the two arithmetic may measure such knowledge in fifth
scores are on the same scale of measurement; if graders, but not in college students. Thus valid-
they are not, we can of course change them to z ity is not a matter of “is this test valid or not”
scores, T scores, or some other scale. but is the test valid for this particular purpose,
in this particular situation, with these particu-
Special circumstances. There are at least two cat- lar subjects. A test of academic aptitude may be
egories of tests where the determination of relia- predictive of performance at a large state uni-
bility requires somewhat more careful thinking. versity but not at a community college. From a
The first of these are speeded tests where differ- classical point of view, there are three major cat-
ent scores reflect different rates of responding. egories of validity, and these are called content
Consider for example a page of text where the validity, criterion validity, and construct valid-
task is to cross out all the letters “e” with a time ity. The division of validity into various parts has
limit of 40 seconds. A person’s score will simply been objected to by many (e.g., Cronbach, 1980;
reflect how fast that person responded to the task. Guion, 1980; Messick, 1975; Tenopyr, 1977). As
Both test-retest and equivalent forms reliability Tenopyr and Oeltjen (1982) stated, it is difficult
are applicable to speeded tests, but split-half and to imagine a measurement situation that does not
internal consistency are not, unless the split is involve all aspects of validity. Although these will
based on time rather than number of items. be presented as separate categories, they really are
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 53

not; validity is best thought of as a unitary pro- spond to the content domain (e.g., Davison,
cess with somewhat different but related facets 1985).
(Cronbach, 1988). Messick (1989) defines valid- Not only should the test adequately cover the
ity as an integrated evaluative judgment of the contents of the domain being measured, but deci-
adequacy and appropriateness of interpretations sions must also be made about the relative rep-
and actions based on the assessment measure. resentation of specific aspects. Consider a test
in this class that will cover the first five chap-
ters. Should there be an equal number of ques-
Content Validity
tions from each chapter, or should certain chap-
Content validity refers to the question of whether ters be given greater preeminence? Certainly,
the test adequately covers the dimension to be some aspects are easier to test, particularly in
measured and is particularly relevant to achieve- a multiple-choice format. But would such an
ment tests. The answer to this question lies less in emphasis reflect “laziness” on the part of the
statistical analyses and more in logical and ratio- instructor, rather than a well thought out plan
nal analyses of the test content and in fact is not designed to help build a valid test? As you see,
considered “true” validity by some (e.g., Guion, the issue of content validity is one whose answer
1977; Messick, 1989). Messick (1989) consid- lies partly in expert skill and partly in individual
ers content validity to have two aspects: content preference. Messick (1989) suggests that content
representativeness and content relevance. Thus validity be discussed in terms of content relevance
items from a domain not only have to represent and content coverage rather than as a category of
that domain but also have to be relevant to that validity, but his suggestion has not been widely
domain. accepted as yet.
When a test is constructed, content validity is
often built in by a concentrated effort to make Taxonomies. Achieving content validity can be
sure that the sample of behavior, that is the test, is helped by having a careful plan of test construc-
truly representative of the domain being assessed. tion, much like a blueprint is necessary to con-
Such an effort requires first of all a thorough struct a house. Such plans take many forms,
knowledge of the domain. If you are develop- and one popular in education is based on a
ing a test of depression, you must be very famil- taxonomy of educational objectives (B. Bloom,
iar with depression and know whether depres- 1956). Bloom and his colleagues have catego-
sion includes affect, sleep disturbances, loss of rized and defined various educational objec-
appetite, restricted interest in various activities, tives – for example, recognizing vocabulary, iden-
lowered self-esteem, and so on. Often teams of tifying concepts, and applying general principles
experts participate in the construction of a test, to new situations. A test constructor would first
by generating and/or judging test items so that develop a twofold table, listing such objectives
the end result is the product of many individuals. on the left-hand side, and topics across the top –
How many such experts should be used and how for example, for an arithmetic test such topics
is their agreement quantified are issues for which might be multiplication, division, etc. For each
no uniformly accepted guidelines exist. For some cell formed by the intersection of any two cate-
suggestions on quantifying content validity, see gories the test constructor decides how many test
Lynn (1986); for a thorough analysis of content items will be written. If the total test items is to
validity, see Hayes, Richard, and Kubany (1995). be 100, the test constructor might decide to have
Evaluating content validity is carried out by 5 multiplication items that assess rote memory,
either subjective or empirical methods. Subjec- and two items that assess applying multiplicative
tive methods typically involve asking experts strategies to new situations. Such decisions might
to judge the relevance and representativeness be based on the relative importance of each cell,
of the test items with regard to the domain might reflect the judgment of experts, or might
being assessed (e.g., Hambleton, 1984). Empir- be a fairly subjective decision.
ical methods involve factor analysis or other Such taxonomies or blueprints are used widely
advanced statistical procedures designed to show in educational tests, sometimes quite explicitly,
that the obtained factors or dimensions corre- and sometimes rather informally. They are rarely
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

54 Part One. Basic Issues

used to construct tests in other domains, such as reflects the particular variable we are interested
personality, although I would strongly argue that in. Let’s assume we have a scholastic aptitude test
such planning would be quite useful and appro- (such as the SAT) that we wish to validate to then
priate. predict grade point average. Ideally, we would
administer the test to an unselected sample, let
them all enter college, wait for 5 years, measure
Criterion Validity
what each student’s cumulative GPA is, and cor-
If a test is said to measure intelligence, we must relate the test scores with the GPA. This would
show that scores on the test parallel or are highly be predictive validity. In real life we would have a
correlated to intelligence as measured in some difficult time finding an unselected sample, con-
other way – that is, a criterion of intelligence. vincing school officials to admit all of them, and
That of course is easier said than done. Think waiting 4 or 5 years. Typically, we would have a
about intelligence. What would be an acceptable more homogeneous group of candidates, some
measure of intelligence? GPA? Extent of one’s of whom would not be accepted into college, and
vocabulary? Amount of yearly income? Reputa- we might not wish or be able to wait any longer
tion among one’s peers? Self-perception? Each of than a semester to collect GPA information.
these could be argued for and certainly argued Under other circumstances, it might make
against. What if we were trying to develop a test sense to collect both the test scores and the cri-
of ego-strength? Where would we find a crite- terion data at the same time. For example, we
rion measure of ego-strength in the real world? might obtain the cooperation of a mechanics’
In essence, a test can never be better than the cri- institute, where all the students can be adminis-
terion it is matched against, and the world simply tered a mechanical aptitude test and have instruc-
does not provide us with clear, unambiguous cri- tors independently rate each student on their
teria. (If it did, it would probably be a very dull mechanical aptitude. This would be concurrent
place!) validity because both the test scores and the crite-
rion scores are collected concurrently. The main
Criteria. The assessment of criterion validity is in purpose of such concurrent validation would be
fact quite common, and the literature is replete to develop a test as a substitute for a more time-
with studies that attempt to match test scores with consuming or expensive assessment procedure,
independent criteria. There are all sorts of criteria such as the use of instructors’ ratings based on
just as there are all sorts of tests, but some types several months’ observation.
of criteria seem to be used more frequently. One We would need to be very careful with both
such criteria is that of contrasted groups, groups predictive and concurrent validity that the crite-
that differ significantly on the particular domain. rion, such as the instructors’ ratings is indepen-
For example, in validating an academic dent of the test results. For example, we would
achievement test we could administer the test to not want the faculty to know the test results
two groups of college students, matched on rel- of students before grades are assigned because
evant variables such as age and gender, but dif- such knowledge might influence the grade; this
fering on grade point average, such as honors is called criterion contamination and can affect
students vs. those on academic probation. the validity of results.
Another common class of criteria are those
reflecting academic achievement, such as GPA,
Construct Validity
being on a Dean’s Honors List, and so on.
Still other criteria involve psychiatric diagnosis, Most if not all of the variables that are of interest
personnel ratings, and quite commonly, other to psychologists do not exist in the same sense
previously developed tests. that a pound of coffee exists. After all, you can-
not buy a pound of intelligence, nor does the
Predictive and concurrent validity. In establish- superego have an anatomical location like a kid-
ing criterion validity, we administer the test to a ney. These variables are “constructs,” theoreti-
group of individuals and we compare their test cal fictions that encapsulate a number of specific
scores to a criterion measure, to a standard, that behaviors, which are useful in our thinking about
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 55

those behaviors. In studying these constructs, we the test scores. Test scores are a function of at least
typically translate them into specific operations, three aspects: the test items, the person respond-
namely tests. Thus the theoretical construct of ing, and the context in which the testing takes
intelligence is translated or operationalized into place. The focus is on the meaning or interpre-
a specific test of intelligence. When we validate a tation of the score, and ultimately on construct
test, we are in effect validating the construct, and validity that involves both score meaning and
in fact quite often our professional interest is not social consequences. (For an interesting com-
so much on the test but on the construct itself. mentary on construct validity see Zimiles, 1996.)
Tests are tools, and a psychologist or other pro- Thus, although we speak of validity as a prop-
fessional is like a cabinetmaker, typically more erty of a test, validity actually refers to the infer-
interested in the eventual product that the tools ence that is made from the test scores (Lawshe,
can help create. He or she knows that poor tools 1985). When a person is administered a test, the
will not result in a fine piece of furniture. result is a sample of that person’s behavior. From
Construct validity is an umbrella term that that sample we infer something – for example,
encompasses any information about a particular we infer how well the person will perform on a
test; both content and criterion validity can be future task (predictive or criterion validity), on
subsumed under this broad term. What makes whether the person possesses certain knowledge
construct validity different is that the validity (content validity), or a psychological construct
information obtained must occur within a the- or characteristic related to an outcome, such as
oretical framework. If we wish to validate a test spatial intelligence related to being an engineer
of intelligence, we must be able to specify in a (construct validity).
theoretical manner what intelligence is, and we Both content validity and criterion validity can
must be able to hypothesize specific outcomes. be conceptualized as special cases of construct
For example, our theory of intelligence might validity. Given this, these different approaches
include the notion that any gender differences should lead to consistent conclusions. Note how-
reflect only cultural “artifacts” of child rearing; ever, that the two approaches of content and cri-
we would then experiment to see whether gen- terion validity ask different questions. Content
der differences on our test do in fact occur, and validity involves the extent to which items rep-
whether they “disappear” when child rearing is resent the content domain. Thus we might agree
somehow controlled. Note that construct valida- that the item “how much is 5 + 3” represents basic
tion becomes a rather complex and never-ending arithmetical knowledge that a fifth grader ought
process, and one that requires asking whether the to have. Criterion validity, on the other hand,
test is, in fact, an accurate reflection of the under- essentially focuses on the difference between con-
lying construct. If it is not, then showing that the trasted groups such as high and low perform-
test is not valid does not necessarily invalidate ers. Thus, under content validity, an item need
the theory. Although construct validity subsumes not show variation of response (i.e., variance)
criterion validity, it is not simply the sum of a among the testees, but under criterion valid-
bunch of criterion studies. Construct validity of ity it must. It is then not surprising that the
a test must be assessed “holistically” in relation two approaches do not correlate significantly in
to the theoretical framework that gave birth to some instances (e.g., Carrier, DaLessio, & Brown,
the test. Some argue that only construct valid- 1990).
ity will yield meaningful instruments (Loevinger,
1957; for a rather different point of view see Methods for assessing construct validity.
Bechtoldt, 1959). In assessing construct valid- Cronbach and Meehl (1955) suggested five
ity, we then look for the correspondence between major methods for assessing construct valid-
the theory and the observed data. Such corre- ity, although many more are used. One such
spondence is sometimes called pattern matching method is the study of group differences. Depend-
(Trochim, 1985; for an example see Marquart, ing upon our particular theoretical framework
1989). we might hypothesize gender differences, dif-
Messick (1995) argues that validity is not a ferences between psychiatric patients and “nor-
property of the test but rather of the meaning of mals,” between members of different political
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

56 Part One. Basic Issues

parties, between Christians and agnostics, and an experimental design called the multitrait-
so on. multimethod matrix to assess both convergent
A second method involves the statistical notion and discriminant validity. Despite what may
of correlation and its derivative of factor analysis, seem confusing terminology, the experimental
a statistical procedure designed to elucidate the design is quite simple, its intent being to measure
basic dimensions of a data set. (For an overview the variation due to the trait of interest, compared
of the relationship between construct validity and with the variation due to the method of testing
factor analysis see B. Thompson & Daniel, 1996.) used.
Again, depending on our theory, we might expect Suppose we have a true-false inventory of
a particular test to show significant correlations depression that we wish to validate. We need first
with some measures and not with others (see of all to find a second measure of depression that
below on convergent and discriminant validity). does not use a true-false or similar format – per-
A third method is the study of the internal con- haps a physiological measure or a 10-point psy-
sistency of the test. Here we typically try to deter- chiatric diagnostic scale. Next, we need to find a
mine whether all of the items in a test are indeed different dimension than depression, which our
assessing the particular variable, or whether per- theory suggests should not correlate but might be
formance on a test might be affected by some confused with depression, for example, anxiety.
other variable. For example, a test of arithmetic We now locate two measures of anxiety that use
would involve reading the directions as well as the same format as our two measures of depres-
the problems themselves, so we would want to be sion. We administer all four tests to a group of
sure that performance on the test reflects arith- subjects and correlate every measure with every
metic knowledge rather than reading skills. other measure. To show convergent validity, we
A fourth method, as strange as it may sound, would expect our two measures of depression
involves test-retest reliability, or more generally, to correlate highly with each other (same trait
studies of change over occasions. For example, is but different methods). To show discriminant
there change in test scores over time, say 2 days validity we would expect our true-false mea-
vs. 4 weeks? Or is there change in test scores if sure of depression not to correlate significantly
the examiner changes, say a white examiner vs. with the true-false measure of anxiety (different
a black examiner? The focus here is on discover- traits but same method). Thus the relationship
ing systematic changes through experimentation, within a trait, regardless of method, should be
changes that again are related to the theoretical higher than the relationship across traits. If it
framework (note the high degree of similarity to is not, it may well be that test scores reflect the
our discussion of generalizability theory). method more than anything else. (For a more
Finally, there are studies of process. Often when recent discussion of the multitrait-multimethod
we give tests we are concerned about the outcome, approach, see Ferketich, Figueredo, & Knapp,
about the score, and we forget that the process – 1991; and Lowe & Ryan-Wenger, 1992; for exam-
how the person went about solving each item – ples of multitrait-multimethod research studies,
is also quite important. This last method, then, see Morey & LeVine, 1988; Saylor et al., 1984.)
focuses on looking at the process, observing how Other more sophisticated procedures have now
subjects perform on a test, rather than just what. been proposed, such as the use of confirmatory
factor analysis (D. A. Cole, 1987).
Convergent and discriminant validity. D. P.
Campbell and Fiske (1959) and D. P. Campbell
Other Aspects
(1960) proposed that to show construct validity,
one must show that a particular test correlates Face validity. Sometimes we speak of face valid-
highly with variables, which on the basis of the- ity, which is not validity in the technical sense, but
ory, it ought to correlate with; they called this refers to whether a test “looks like” it is measuring
convergent validity. They also argued that a test the pertinent variable. We expect, for example, a
should not correlate significantly with variables test of intelligence to have us define words and
that it ought not to correlate with, and called solve problems, rather than to ask us questions
this discriminant validity. They then proposed about our musical and food preferences. A test
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 57

may have a great deal of face validity yet may not Thus, with the 500+ items of the MMPI, we
in fact be valid. Conversely, a test may lack face can assess a broad array of psychopathology, but
validity but in reality be a valid measure of a par- none in any depth. If we had 500+ items all
ticular variable. Clearly, face validity is related to focused on depression, we would have a more
client rapport and cooperation, because ordinar- precise instrument, i.e., greater fidelity, but we
ily, a test that looks valid will be considered by would only be covering one area.
the client more appropriate and therefore taken
more seriously than one that does not. There Group homogeneity. If we look at various mea-
are occasions, however, where face validity may sures designed to predict academic achievement,
not be desirable, for example, in a test to detect such as achievement tests used in the primary
“honesty” (see Nevo, 1985, for a review). grades, those used in high school, the SAT used
for college admissions, and the GRE (Graduate
Differential validity. Lesser (1959) argued that Record Examination) used for graduate school
we should not consider a test as valid or invalid admissions, we find that the validity coefficients
in a general sense, that studies sometimes obtain are generally greater at the younger ages; there
different results with the same test not necessar- is a greater correlation between test scores and
ily because the test is invalid, but because there high-school grades than there is between test
is differential validity in different populations, scores and graduate-school grades. Why? Again,
and that such differential validity is in fact a pre- lots of reasons of course, but many of these rea-
dictable phenomenon. sons are related to the notion that variability is
lessened. For example, grades in graduate school
Meta-analysis. Meta-analysis consists of a num- show much less variability than those in high
ber of statistical analyses designed to empiri- school because often only As and Bs are awarded
cally assess the findings from various studies on in graduate seminars. Similarly, those who apply
the same topic. In the past, this was done by and are admitted to graduate school are more
a narrative literature review where the reviewer homogeneous (similar in intelligence, motiva-
attempted to logically assess the state of a partic- tion to complete their degrees, intellectual inter-
ular question or area of research. ests, etc.) as a group than high-school students.
For an example of a meta-analysis on the Beck All other things being equal, homogeneity results
Depression Inventory, see Yin and Fan (2000). in a lowered correlation between test scores and
criterion.
Validity generalization. Another approach is One practical implication of this is that when
that of validity generalization, where correlation we validate a test, we should validate it on unse-
coefficients across studies are combined and sta- lected samples, but in fact they may be difficult
tistically corrected for such aspects as unrelia- or impossible to obtain. This means that a test
bility, sampling error, and restriction in range that shows a significant correlation with college
(Schmidt & Hunter, 1977). grades in a sample of college students may work
even better in a sample of high-school students
applying to college.
ASPECTS OF VALIDITY

Bandwidth fidelity. Cronbach and Gleser Cross-validation. In validating a test, we collect


(1965) used the term bandwidth to refer to the information on how the test works in a particu-
range of applicability of a test – tests that cover lar sample or situation. If we have data on sev-
a wide area of functioning such as the MMPI eral samples that are similar, we would typically
are broad-band tests; tests that cover a narrower call this “validity generalization.” However, if we
area, such as a measure of depression, are make some decision based on our findings – for
narrow-band tests. These authors also used the example, we will accept into our university any
term fidelity to refer to the thoroughness of the students whose combined SAT scores are above
test. These two aspects interact with each other, 1200 – and we test this decision out on a sec-
so that given a specific amount (such as test ond sample, that is called cross-validation. Thus
items) as bandwidth increases, fidelity decreases. cross-validation is not simply collecting data on a
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

58 Part One. Basic Issues

second sample, but involves taking a second look relationship between the two variables, depends
at a particular decision rule. in part upon the size of the sample on which it
is based. But statistical significance may not be
Are reliability and validity related? We have dis- equivalent to practical significance. A test may
cussed reliability and validity separately because correlate significantly with a criterion, but the sig-
logically they are. They are however also related. nificance may reflect a very large sample, rather
In the multitrait-multimethod approach, for than practical validity. On the other hand, a test
example, our two measures of depression differ of low validity may be useful if the alternative
in their method, and so this is considered to be ways of reaching a decision are less valid or not
validity. What if the two forms did not differ in available.
method? They would of course be parallel forms One useful way to interpret a validity coeffi-
and their relationship would be considered relia- cient is to square its value and take the resulting
bility. We have also seen that both internal consis- number as an index of the overlap between the
tency and test-retest reliability can be seen from test and the criterion. Let’s assume for example,
both a reliability framework or from a validity that there is a correlation of about .40 between
framework. SAT (a test designed to measure “scholastic apti-
Another way that reliability and validity are tude”) scores and college GPA. Why do different
related is that a test cannot be valid if it is not people obtain different scores on the SAT? Lots
reliable. In fact, the maximum validity coefficient of reasons, of course – differences in motivation,
between two variables is equal to: interest, test sophistication, lack of sleep, anxiety,
√ and so on – but presumably the major source of
r 11r 22 ,
variation is “scholastic aptitude.” Why do differ-
where r11 again represents the reliability coeffi- ent people obtain different grades? Again, lots of
cient of the first variable (for example, a test) and different reasons, but if there is an r of .40 between
r22 the reliability coefficient of the second vari- SAT and GPA, then .40 squared equals .16; that
able (for example, a criterion). If a test we are is, 16% of the variation in grades will be due to
trying to validate has, for example, a reliability of (or explained by) differences in scholastic apti-
.70 and the criterion has a reliability of .50, then tude. In this case, that leaves 84% of the variation
the maximum validity coefficient we can obtain is in grades to be “explained” by other variables.
.59. (Note, of course, that this is the same formula Even though an r of .40 looks rather large, and is
we used for the correction for attenuation.) indeed quite acceptable as a validity coefficient,
its explanatory power (16%) is rather low – but
Interpreting a validity coefficient. Much of the this is a reflection of the complexity of the world,
evidence for the validity of a test will take the form rather than a limitation of our tests.
of correlation coefficients, although of course
other statistical procedures are used. When we Prediction. A second way to interpret a validity
discussed reliability, we said that it is generally correlation coefficient is to recall that where there
agreed that for a test to be considered reliable, is a correlation the implication is that scores on
its reliability correlation coefficient should be at the criterion can be predicted, to some degree, by
least .70. In validity, there is no such accepted scores on the test. The purpose of administering a
standard. In general, validity coefficients are sig- test such as the SAT is to make an informed judg-
nificantly lower because we do not expect sub- ment about whether a high-school senior can do
stantial correlations between tests and complex college work, and to predict what that person’s
real-life criteria. For example, academic grades GPA will be. Such a prediction can be made by
are in part a function of intelligence or academic realizing that a correlation coefficient is simply
achievement, but they can also reflect motiva- an index of the relationship between two vari-
tion, interest in a topic, physical health, whether ables, a relationship that can be expressed by the
a person is in love or out of love, etc. equation Y = b X + a, where Y might be the GPA
Whether a particular validity correlation coef- we wish to predict, X is the person’s SAT score
ficient is statistically significant, of sufficient and b and a reflect other aspects of our data (we
magnitude to indicate that most likely there is a discussed the use of such equations in Chapter 2).
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 59

Cumulative GPA

Combined SAT Score 3.5 and above 2.5 to 3.49 2.49 and below Total

1400 and above 18 4 3 = (25)

1000 to 1399 6 28 11 = (45)

999 and below 2 16 12 = (30)

FIGURE 3–3. Example of an expectancy table.

Expectancy table. Still, a third way to interpret ordinarily have a test that has less than perfect
a validity correlation is through the use of an validity, and so when we use that test score to
expectancy table (see Chapter 2). Suppose we have predict a criterion score, our predicted score will
administered the SAT to a group of 100 students have a margin of error. That margin of error can
entering a university, and after 4 years of college be defined as the SE of estimate which equals:
work we compute their cumulative GPA. We table 
the data as shown in Figure 3.3. SD 1 − r 122

What this table shows is that 18 of the 25


students (or 72%) who obtained combined SAT where SD is the standard deviation of the cri-
scores of 1,400 and above obtained a cumula- terion scores and r12 is the validity coefficient.
tive GPA of 3.5 or above, whereas only 6 of the Note that if the test had perfect validity, that is
45 students (13%) who scored between 1,000 to r12 = 1.00, then the SE of estimate is zero; there
1,399 did such superior work, and only 2 of the 12 would be no error, and what we predicted as a
(16%) who scored 999 and below. If a new student criterion score would indeed be correct. At the
with SAT scores of 1,600 applied for admission, other extreme, if the test were not valid, that is
our expectancy table would suggest that indeed r12 = zero, then the SE of estimate would equal
the new student should be admitted. the SD, that is, what we predicted as a criterion
This example is of course fictitious but illus- score could vary by plus or minus a SD 68% of
trative. Ordinarily our expectancy table would the time. This would be akin to simply guessing
have more categories, both for the test and the what somebody’s criterion score might be.
criterion. Note that although the correlation is
based on the entire sample, our decision about Decision theory. From the above discussion of
a new individual would be based on just those validity, it becomes evident that often the useful-
cases that fall in a particular cell. If the number ness of a test can be measured by how well the
of cases in a cell is rather small (for example, the test predicts the criterion. Does the SAT predict
two individuals who scored below 999 but had a academic achievement? Can a test of depression
GPA of 3.5 and above), then we need to be care- predict potential suicide attempts? Can a measure
ful about how confident we can be in our deci- of leadership identify executives who will exercise
sion. Expectancy tables can be more complex and that leadership? Note that in validating a test we
include more than two variables – for example, both administer the test and collect information
if gender or type of high school attended were on the criterion. Once we have shown that the
related to SAT scores and GPA, we could include test is valid for a particular purpose, we can then
these variables into our table, or create separate use the test to predict the criterion. Because no
tables. test has perfect validity, our predictions will have
errors.
Standard error of estimate. Still another way to Consider the following example. Students
interpret a validity coefficient is by recourse to entering a particular college are given a medi-
the standard error. In talking about reliability, we cal test (an injection) to determine whether or
talked about “noise in the system,” that is lack not they have tuberculosis. If they have TB, the
of perfect reliability. Similarly with validity we test results will be positive (a red welt will form);
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

60 Part One. Basic Issues

Real World
Positive Negative

Positive
A C

Hits Errors
False Positives
Test for TB

Negative D B
Errors Hits
False Negatives

FIGURE 3–4. Decision categories.

if they don’t, the test results will be negative (no eral years we have collected information at our
welt). The test, however, does not have perfect particular college on SAT scores and subsequent
validity, and the test results do not fully corre- passing or failing. Assuming that we find a cor-
spond to the real world. Just as there are two relation between these two variables, we can set
possible outcomes with the test (positive or nega- up a decision table like the one in Figure 3.5.
tive), there are two possibilities in the real world: Again we have four categories. Students in cell
either the person has or does not have TB. Look- A are those for whom we predict failure based on
ing at the test and at the world simultaneously their low SAT scores, and if they were admitted,
yields four categories, as shown in Figure 3.4. they would fail. Category B consists of students
Category A consists of individuals who on the for whom we predict success, are admitted, and
test are positive for TB and indeed do have TB. do well academically. Both categories A and B are
These individuals, from a psychometric point of hits. Again, we have two types of errors: the false
view, are considered “hits” – the decision based positives of category C for whom we predicted
on the test matches the real world. Similarly, cat- failure, but would have passed had they been
egory B consists of individuals for whom the test admitted, and the false negatives of category D
results indicate that the person does not have (is for whom we predicted success, but indeed once
negative for) TB, and indeed they do not have admitted, they failed.
TB – another category that represents “hits.”
There are, however, two types of errors. Cate- Sensitivity, specificity, and predictive value.
gory C consists of individuals for whom the test The relative frequencies of the four categories
results suggest that they are positive for TB, but lead to three terms that are sometimes used in
they do not have TB; these are called false positives. the literature in connection with tests (Galen &
Category D consists of individuals for whom the Gambino, 1975). The sensitivity of a test is the
test results are negative. They do not appear to proportion of correctly identified positives (i.e.,
have TB but in fact they do; thus they are false how accurately does a test classify a person who
negatives. has a particular disorder?), that is, true positives,
We have used a medical example because the and is defined as:
terminology comes from medicine, and it is Sensitivity
important to recognize that medically to be “pos- true positives
itive” on a test is not a good state of affairs. Let’s = × 100
true positives + false negatives
turn now to a more psychological example and
use the SAT to predict whether a student will In the diagram of Figure 3.4, this ratio equals
pass or fail in college. Let’s assume that for sev- A/A + D.
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 61

College

Student Fails Student Passes

Positive
(student will fail) C
A
Errors
Hits
SAT False Positives

Negative
(student will pass) D
B
Errors
Hits
False Negatives

FIGURE 3–5. Example of a decision table.

The specificity of a test is the proportion of cor- patients, 63 committed suicide. Using a number
rectly identified negatives, (i.e., how accurately of tests to make predictions about subsequent
does a test classify those who do NOT have the suicide, Pokorny obtained the results shown in
particular condition?), that is, true negatives, and Figure 3.6.
is defined as: The sensitivity of Pokorny’s procedure is thus:
35 35
Specificity Sensitivity = = = 55%
true negatives 35 + 28 63
= × 100
true negatives + false positives The specificity of Pokorny’s procedure is:
or B/C + B. 3435 3435
Specificity = = = 74%
The predictive value (also called efficiency) of 3435 + 1206 4641
a test is the ratio of true positives to all positives, and the predictive value is:
and is defined as:
35 35
Predictive value = = = 2.8%
Predictive value 35 + 1206 1241
true positives
= × 100 Note that although the sensitivity and speci-
true positives + false positives ficity are respectable, the predictive value is
or A/A + C. extremely low.
An ideal test would have a high degree of sen-
sitivity and specificity, as well as high predictive Reducing errors. In probably every situation
value, with a low number of false positives and where a series of decisions is made, such as which
false negative decisions. (See Klee & Garfinkel 2,000 students to admit to a particular university,
[1983] for an example of a study that uses the there will be errors made regardless of whether
concepts of sensitivity and specificity; see also those decisions are made on the basis of test
Baldessarini, Finkelstein, & Arana, 1983; Gerardi, scores, interview information, flipping of a coin,
Keane, & Penk, 1989.) or other method. Can these errors be reduced?
Yes, they can. First of all, the more valid the mea-
An example from suicide. Maris (1992) gives an sure or procedure on which decisions are based,
interesting example of the application of deci- the fewer the errors. Second, the more compre-
sion theory to some data of a study by Pokorny hensive the database available on which to make
of 4,704 psychiatric patients who were tested decisions, the fewer the errors; for example, if
and followed up for 5 years. In this group of we made decisions based only on one source of
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

62 Part One. Basic Issues

Real World
Did not
Committed suicide commit suicide

Will commit = 1241 cases


35 1206
suicide predicted as
true false suicide
Test positives positives
Prediction

Will not
commit suicide
28 3435
= 3463 cases
false true predicted as
negatives negatives nonsuicide

= 63 who committed = 4641 who did not


suicide commit suicide
FIGURE 3–6. Example of a decision table as applied to suicide.

information, the SAT for example – vs. using use, the score that we define as acceptable or not
multiple data sources, the SAT plus high-school acceptable, is called the cutoff score (see Meehl &
grades, plus autobiographical statement, plus let- Rosen, 1955, for a discussion of the problems in
ters of recommendation, etc. – we would make setting cutoff scores).
greater errors where we used only one source of
information. Of course, adding poor measures Which type of error? Which type of error are we
to our one source of information might in fact willing to tolerate more? That of course depends
increase our errors. We can also use sequential upon the situation and upon philosophical, eth-
strategies. In the example of TB screening, the ical, political, economic, and other issues. Some
initial test is relatively easy and inexpensive to people, for example, might argue that for a state
administer, but produces a fair number of errors. university it is better to be liberal in admission
We could follow up those individuals who show standards and allow almost everyone in, even if a
signs of being positive on the test by more sophis- substantial number of students will never grad-
ticated and expensive tests to identify more of the uate. In some situations, for example selecting
false positives. individuals to be trained as astronauts, it might
We can also change the decision rule. For be better to be extremely strict in the selection
example, instead of deciding that any student standards and choose individuals who will be
whose combined SAT score is below 800 is at risk successful at the task, even if it means keeping
to fail, we could lower our standards and use a out many volunteers who might have been just
combined score of 400. Figure 3.7 shows what as successful.
would happen.
Our rate of false positives, students for whom Selection ratio. One of the issues that impinges
we are predicting failure but indeed would pass, on our decision and the kind of errors we tolerate
is lowered. However, the number of false nega- is the selection ratio, which refers to the number
tives, students for whom we predict success but in of individuals we need to select from the pool of
fact will fail, is now substantially increased. If we applicants. If there are only 100 students apply-
increase our standards, for example, we require ing to my college and we need at least 100 paying
a combined SAT score of 1,400 for admission, students, then I will admit everyone who applies
then we will have the opposite result: The num- and won’t care what their SAT scores are. On the
ber of false positives will increase and the number other hand, if I am selecting scholarship recip-
of false negatives will decrease. The standard we ients and I have two scholarships to award and
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 63

College

Student fails Student passes

C
Positive A
(student will fail if Errors
Hits
SAT is below 400) False Positives

SAT

D
Negative B
(student will pass if Errors
Hits
SAT is above 400)
False Negatives

FIGURE 3–7. Decision table for college admissions.

100 candidates, I can be extremely demanding in declare that “no one commits suicide in jail,” I
my decision, which will probably result in a high would be correct 99% of the time. When base
number of false positives. rates are extreme, either high or low, our accu-
racy rate goes down. In fact, as Meehl and Rosen
The base rate. Another aspect we must take into (1955) argued years ago, when the base rate of the
consideration is the base rate, that is the natu- criterion deviates significantly from a 50% split,
rally occurring frequency of a particular behav- the use of a test or procedure that has slight or
ior. Assume, for example, that I am a psycholo- moderate validity could result in increased errors.
gist working at the local jail and over the years Base rates are often neglected by practitioners; for
have observed that about one out of 100 prison- a more recent plea to consider base rates in the
ers attempts suicide (the actual suicide rate seems clinical application of tests see Elwood (1993).
to be about 1 in 2,500 inmates, a rather low base Obviously, what might be correct from a sta-
rate from a statistical point of view; see Salive, tistical point of view (do nothing) might not be
Smith, & Brewer, 1989). As prisoners come into consonant with ethical, professional, or human-
the jail, I am interested in identifying those who itarian principles. In addition, of course, an
will attempt suicide to provide them with the nec- important consideration would be whether one
essary psychological assistance and/or take the individual who will attempt suicide can be iden-
necessary preventive action such as removal of tified, even if it means having a high false positive
belts and bed sheet and 24-hour surveillance. If rate. Still another concern would be the availabil-
I were to institute an entrance interview or test- ity of the information needed to assess base rates –
ing of new inmates, what would happen? Let’s quite often such information is lacking. Perhaps
say that I would identify 10 inmates out of 100 as it might be appropriate to state the obvious: The
probable suicide attempters; those 10 might not use of psychological tests is not simply a function
include the one individual who really will com- of their psychometric properties, but involves a
mit suicide. Notice then that I would be correct 89 variety of other concerns; in fact, the very issue
out of 100 times (the 89 for whom I would pre- of whether validity means utility, whether a par-
dict no suicide attempt and who would behave ticular test should be used simply because it is
accordingly). I would be incorrect 11 out of 100 valid, is a source of controversy (see Gottfredson
times, for the 10 false positive individuals whom I & Crouse, 1986).
would identify as potential suicides, and the I false
negative whom I would not detect as a potential Sample size. Another aspect that influences
suicide. But if I were to do nothing and simply validity is the size of the sample that is studied
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

64 Part One. Basic Issues

when a test is validated, an issue that we have hand, we expect a certain amount of stability of
already mentioned (Dahlstrom, 1993). Suppose results across studies, but on the other, when we
I administer a new test of intelligence to a sample don’t obtain such stability, we need to be aware
of college students and correlate their scores on and identify the various sources for obtaining dif-
the test with their GPA. You will recall whether ferent results. Changes occur from one setting to
or not a correlation coefficient is statistically sig- another and even within a particular setting. Per-
nificant or is different from zero, is a function haps a study conducted in the 1970s consisted
of the sample size. For example, here are corre- primarily of white male middle-class students,
lation coefficients needed for samples of various whereas now any representative sample would
size, using the .05 level of significance: be much more heterogeneous. Perhaps at one
university we may have grade inflation while,
Sample size Correlation coefficient
at another, the grading standards may be more
10 .63
rigorous.
15 .51
20 .44
Taylor and Russell Tables. The selection ratio,
80 .22
the base rate, and the validity of a test are all
150 .16
related to the predictive efficiency of that test. In
Note that with a small sample of N = 10, we fact, H. C. Taylor and Russell (1939) computed
would need to get a correlation of at least .63 to tables that allow one to predict how useful a par-
conclude that the two variables are significantly ticular test can be in a situation where we know
correlated, but with a large sample of N = 150, the the selection ratio, the base rate, and the validity
correlation would need to be only .16 or larger to coefficient.
reach the same conclusion. Schmidt and Hunter
(1980) have in fact argued that the available evi-
Validity from an Individual Point of View
dence underestimates the validity of tests because
samples, particularly those of working adults in Most, if not all, of our discussion of validity
specific occupations, are quite small. stems from what can be called a nomothetic point
of view, a scientific approach based on general
Validity as a changing concept. What we have laws and relations. Thus with the SAT we are
discussed above about validity might be termed interested in whether SAT scores are related to
the “classical” view. But our understanding of college grades, whether SAT scores predict col-
validity is not etched in stone and is evolving just lege achievement in minority groups to the same
as psychology evolves. In a historical overview extent as in the majority, whether there may
of the concept of validity, Geisinger (1992) sug- be gender differences, and whether scores can
gests that the concept has and is undergoing a be maximized through calculated guessing. Note
metamorphosis and has changed in several ways. that these and other questions focus on the SAT as
Currently, validity is focused on validating a test a test, the answers involve psychometric consid-
for a specific application with a specific sample erations, and we really don’t care who the specific
and in a specific setting; it is largely based on the- subjects are, beyond the requirements that they
ory, and construct validity seems to be rapidly be representative, an so on.
gaining ground as the method. The typical practitioner, however, whether
In a recent revision of the Standards for Educa- clinical psychologist, school counselor, or psy-
tional and Psychological Testing (1999), the com- chiatric nurse, is usually interested not so much
mittee who authored these standards argue per- in the test as in the client who has taken the
suasively that validity needs to be considered in test. As Gough (1965) indicated, the practitioner
the broad context of generalizability. That is, sim- uses tests to obtain a psychological description of
ply because one research study shows a correla- the client, to predict what the client will say or
tion coefficient of +.40 between SAT scores and do, and to understand how others react to this
1st-year college GPA at a particular institution, client.
doesn’t necessarily mean that the same result will Gough (1965) then developed a concep-
be obtained at another institution. On the one tual model of validity, not aimed at just a
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

Reliability and Validity 65

psychometric understanding of the test, but at the new test, like the SAT, is basically a measure
a clinical understanding of the client. Gough of scholastic aptitude, and uses the kind of items
(1965) proposed that if a practitioner wishes to that are relevant to school work. Because the SAT
use a particular test to understand a client, there is so well established, why bother with a new
are three questions or types of validity he or she measure? Suppose however, that an analysis of
must be concerned with. (For a slightly differ- the evidence suggests that the new measure also
ent tripartite conceptualization of validity, espe- identifies students who are highly creative, and
cially as applied to sociological measurement, see the measure takes only 10 minutes to administer.
Bailey, 1988.) I may not necessarily be interested in whether my
client, say a business executive unhappy with her
Primary validity. The first question concerns the position, has high academic achievement poten-
primary validity of the test; primary validity is tial, but I may be very interested in identifying
basically similar to criterion validity. If someone her level of creativity. (For specific examples of
publishes a new academic achievement test, we how the three levels of validity are conceptualized
would want to see how well the test correlates with individual tests see Arizmendi, Paulsen, &
with GPA, whether the test can in fact separate G. Domino, 1981; G. Domino & Blumberg, 1987;
honors students from nonhonors students, and and Gough, 1965.)
so on. This is called primary because if a test does
not have this kind of basic validity, we must look
A Final Word about Validity
elsewhere for a useful measure.
When we ask questions about the validity of a
Secondary validity. If the evidence indicates that test we must ask “validity for what?” and “under
a test has primary validity, then we move on to what circumstances?” A specific test is not valid
secondary validity that addresses the psychologi- in a general sense. A test of depression for exam-
cal basis of measurement of the scale. If the new ple, may be quite valid for psychiatric patients
“academic achievement” test does correlate well but not for college students. On the other hand,
with GPA, then we can say, “fine, but what does we can ask the question of “in general, how valid
the test measure?” Just because the author named are psychological tests?” Meyer and his colleagues
it an “academic achievement” test does not nec- (2001) analyzed data from the available litera-
essarily mean it is so. To obtain information on ture and concluded that not only is test validity
secondary validity, on the underlying psycholog- “strong and compelling” but that the validity of
ical dimension that is being measured, Gough psychological tests is comparable to the validity
(1965) suggested four steps: (1) reviewing the of medical procedures.
theory behind the test and the procedures and
samples used to develop the test; (2) analyzing
SUMMARY
from a logical-clinical point of view the item con-
tent (Is a measure of depression made up pri- Reliability can be considered from a variety of
marily of items that reflect low self-esteem?); (3) points of view, including stability over time and
relating scores on the measure being considered equivalence of items, sources of variation, and
to variables that are considered to be important, “noise in the system.” Four ways to assess reli-
such as gender, intelligence, and socioeconomic ability have been discussed: test-retest reliabil-
status; (4) obtaining information about what ity, alternate forms reliability, split-half reliabil-
high scorers and low scorers on the scale are like ity, and interitem consistency. For some tests,
psychologically. we need also to be concerned about scorer or
rater reliability. Although reliability is most often
Tertiary validity. Tertiary validity is concerned measured by a correlation coefficient, the stan-
with the justification for developing and/or using dard error of measurement can also be useful. A
a particular test. Suppose for example, the new related measure, the standard error of differences
“academic achievement” test we are considering is useful when we consider whether the difference
predicts GPA about as well as the SAT. Suppose between two scores obtained by an individual is
also a secondary validity analysis suggests that indeed meaningful.
P1: JZP
0521861810c03 CB1038/Domino 0 521 86181 0 March 4, 2006 14:16

66 Part One. Basic Issues

Validity, whether a test measures what it is said Domino, G., & Blumberg, E. (1987). An application of
to measure, was discussed in terms of content Gough’s conceptual model to a measure of adolescent
validity, criterion validity, and construct validity. self-esteem. Journal of Youth and Adolescence, 16, 179–
Content validity is a logical type of validity, par- 190.
ticularly relevant to educational tests, and is the An illustration of Gough’s conceptual model as applied to a
result of careful planning, of having a blueprint paper-and-pencil measure of self-esteem.
to how the test will be constructed. Criterion Hadorn, D. C., & Hays, R. D. (1991). Multitrait-
validity concerns the relationship of a test to multimethod analysis of health-related quality-of-life
specified criteria and is composed of predictive measures. Medical Care, 29, 829–840.
and concurrent validity. Construct validity is an An example of the multitrait-multimethod approach as
umbrella term that can subsume all other types of applied to the measurement of quality of life.
validity and is principally related to theory con- Messick, S. (1995) Validity of psychological assess-
struction. A method to show construct validity ment. American Psychologist, 50, 741–49.
is the multitrait-multimethod matrix which gets Messick argues that the three major categories of validity –
at convergent and discriminant validity. There content, criterion, and construct validity – present an incom-
are various ways to interpret validity coefficients, plete and fragmented view. He argues that these are but a part
of a comprehensive theory of construct validity that looks not
including squaring the coefficient, using a predic-
only at the meaning of scores, but the social values inherent
tive equation, an expectancy table, and the stan- in test interpretation and use.
dard error of estimate. Because errors of predic-
tion will most likely occur, we considered validity
DISCUSSION QUESTIONS
from the point of view of false positives and false
negatives. In considering validity we also need to 1. I have developed a test designed to assess cre-
be mindful of the selection ratio and the base rate. ativity in adults. The test consists of 50 true-
Finally we considered validity from an “individ- false questions such as, “Do you consider your-
ual” point of view. self creative?” and “As a child were you extremely
curious?” How might the reliability and validity
of such a test be determined?
SUGGESTED READINGS
2. For the above test, assume that it is based on
Cronbach, L. J., & Meehl, P. E. (1955). Construct valid- psychoanalytic theory that sees creativity as the
ity in psychological tests. Psychological Bulletin, 52, result of displaced sexual and aggressive drives.
281–302.
How might the construct validity of such a test
A basic and classic paper that focused on construct validity. be determined?
Dahlstrom, W. G. (1993). Tests. Small samples, 3. Why is reliability so important?
large consequences. American Psychologist, 48, 393– 4. Locate a meta-analytical study of a psycholog-
399. ical test. What are the conclusions arrived at by
The author argues that tests, if soundly constructed and the author(s)? Is the evidence compelling?
responsibly applied, can offset the errors or judgment
often found in daily decision making. A highly readable 5. In your own words, define the concepts of sen-
article. sitivity and specificity.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

PART TWO: DIMENSIONS OF TESTING

4 Personality

AIM This chapter focuses on the assessment of “normal” personality. The question
of how many basic personality dimensions exist, and other basic issues are discussed.
Nine instruments illustrative of personality assessment are considered; some are well
known and commercially available, while others are not. Finally, the Big-Five model,
currently a popular one in the field of personality assessment, is discussed.

INTRODUCTION assessment of psychopathology such as depres-


sion, and psychopathological states such as
Personality schizophrenia; we discuss these in Chapter 7
and in Chapter 15. Finally, most textbooks also
Personality occupies a central role both in the include the assessment of positive functioning,
field of psychology and in psychological testing. such as creativity, under the rubric of person-
Although the first tests developed were not of per- ality. Because we believe that the measurement
sonality but of aptitude (by the Chinese) and of of positive functioning has in many ways been
intelligence (by the French psychologist, Binet), neglected, we discuss the topic in Chapter 8.
the assessment of personality has been a major
endeavor.
If this were a textbook on personality, we would
Internal or External?
probably begin with a definition of personality
and, at the very least, an entire chapter would When you do something, why do you do it? Are
illustrate the diversity of definitions and the vari- the determinants of your behavior due to inner
ety of viewpoints and arguments embedded in causes, such as needs, or are they due to external
such definitions. Since this is not such a text- causes such as the situation you are in? Scien-
book, we defer such endeavors to the experts tists who focus on the internal aspects empha-
(e.g., Allport, 1937; 1961; Guilford, 1959b; Hall & size such concepts as personality traits. Those
Lindzey, 1970; McClelland, 1951; Mischel, 1981; who focus on the external aspects, emphasize
Wiggins, 1973). more situational variables. For many years, the
In general, when we talk about personality trait approach was the dominant one, until about
we are talking about a variety of characteristics 1968 when Mischel published a textbook titled,
whose unique organization define an individual, Personality and Assessment, and strongly argued
and to a certain degree, determine that person’s that situations had been neglected, and that to
interactions with himself/herself, with others, fully understand personality one needed to pay
and with the environment. A number of authors attention to the reciprocal interactions between
consider attitudes, values, and interests under person and situation. This message was not new;
the rubric of personality; these are discussed in many other psychologists such as Henry Murray,
Chapter 6. Still others, quite correctly, include the had made the same argument much earlier. The

67
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

68 Part Two. Dimensions of Testing

message is also quite logical; if nothing else, we responsible” to “not at all responsible.” Two inter-
know that behavior is multiply determined, that esting questions can now be asked: (1) how
typically a particular action is the result of many do these two methods relate to each other –
aspects. does the person who scores high on the scale of
Endler and Magnusson (1976) suggested that responsibility also score high on the self-rating of
there are five major theoretical models that responsibility? and (2) which of these two meth-
address the above question: ods is more valid – which scores, the personality
inventory or the self-ratings, will correlate more
1. The trait model. This model assumes that highly with an external, objective criterion? (Note
there is a basic personality core, and that traits are that basically we are asking the question: Given
the main source of individual differences. Traits two methods of eliciting information, which is
are seen as quite stable. better?)
2. The psychodynamic model. This model also There seems to be some evidence that suggests,
assumes the presence of a basic personality core that at least in some situations, self-ratings tend to
and traits as components. But much of the focus be the better method, that self-ratings turn out to
is on developmental aspects and, in particular, be slightly more valid than corresponding ques-
how early experiences affect later development. tionnaire scales. The difference between the two
3. The situational model. This model assumes methods is not particularly large, but has been
that situations are the main source of behavioral found in a number of studies (e.g., M. D. Beck
differences. Change the situation and you change & C. K. Beck, 1980; Burisch, 1984; Carroll, 1952;
the behavior. Thus, instead of seeing some people Shrauger & Osberg, 1981). Why then use a test?
as honest, and some less than honest, honesty is a In part, because the test parallels a hypothesized
function of the situation, of how much gain is at dimension, and allows us to locate individuals on
stake, of whether the person might get away with that dimension. In essence, it’s like asking peo-
something, and so on. ple how tall they are. The actual measurement (5
4. The interaction model. This model assumes feet 8 inches) is more informative than the rating
that actual behavior is the result of an interaction “above average.”
between the person and the situation. Thus, a
person can be influenced by a situation (a shy Self-report measures. One of the most common
person speaking up forcefully when a matter of ways of assessing personality is to have the indi-
principle is at stake), but a person also chooses vidual provide a report of their own behavior.
situations (preferring to stay home rather than The report may be a response to an open-ended
going to a party) and influences situations (being question (tell me about yourself), may require
the “hit of the party”). selecting self-descriptive adjectives from a list, or
5. The phenomenological model. This model answering true-false to a series of items. Such
focuses on the individual’s introspection (look- self-report measures assume, on the one hand,
ing inward) and on internal, subjective experi- that individuals are probably in the best position
ences. Here the construct of “self-concept” is an to report on their own behavior. On the other
important one. hand, most personality assessors do not blindly
assume that if the individual answers true to the
item “I am a sociable person,” the person is in fact
sociable. It is the pattern of responses related to
SOME BASIC ISSUES
empirical criteria that is important. In fact, some
Self-rating scales. Assume we wanted to mea- psychologists (e.g., Berg, 1955; 1959) have argued
sure a person’s degree of responsibility. We could that the content of the self-report is irrelevant;
do this in a number of ways, but one way would what is important is whether the response devi-
be to administer the person a personality test ates from the norm.
designed to measure responsibility; another way Whether such reporting is biased or unbiased
would be simply to ask the person, “How respon- is a key issue that in part involves a philosoph-
sible are you?” and have them rate themselves ical issue: Are most people basically honest and
on a simple 5-point scale, ranging from “highly objective when it comes to describing themselves?
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 69

Obviously, it depends. At the very least, it depends specific situation, which may extend over a time
on the person and on the situation; some people period, and may be natural (observing children
are more insightful than others about their own on a playground), or contrived (bringing sev-
behavior, and in some situations, some people eral managers together in a leaderless discussion).
might be more candid than others in admitting Interviews might be considered an example of
their shortcomings. Many self-report techniques, situational methods, and these are discussed in
especially personality inventories, have incorpo- Chapter 18.
rated within them some means of identifying the
extent to which the respondent presents a biased Behavioral assessment. Most of the categories
picture; these are called validity scales because listed above depend on the assumption that what
they are designed to tell us whether the measure- is being reported on or rated is a trait, a theo-
ment is valid or distorted. Some of these tech- retical construct that allows us to explain behav-
niques are discussed in Chapter 16. ior. Some psychologists argue that such explana-
tory concepts are not needed, that we can focus
Projective measures. One common type of self- directly on the behavior. Thus behavioral assess-
report is the personality inventory that con- ment involves direct measures of behavior, rather
sists of a number of scales with the items than of such constructs as anxiety, responsibil-
printed together, typically in a random sequence. ity, or flexibility. We discuss this concept and its
These are often called objective personality tests applications in Chapter 18.
because the scoring of the items and the mean-
ing assigned to specific scores are not arbitrary. Other approaches There are, of course, many
In contrast, there are a number of techniques ways of studying personality other than through
called projective techniques that involve the pre- the administration of a personality inventory. A
sentation of an ambiguous set of stimuli, such wide variety of procedures have been used, some
as inkblots, sentence stems, or pictures, to which with moderate success, ranging from the study of
the respondents must impose some structure that eye pupil dilation and constriction (E. H. Hess,
presumably reflects their own personality and 1965; E. H. Hess & Polt, 1960), the study of
psychological functioning. Because these tech- head and body cues (Ekman, 1965), hand move-
niques are used more extensively in the clinic, ment (Krout, 1954), voice characteristics (Mal-
we discuss them in Chapter 15. lory & Miller, 1958), and of course handwriting or
graphology (Fluckiger, Tripp, & Weinberg, 1961).
Rating scales. Rating scales typically consist of
a variable to be rated, for example, “leadership
Traits and Types
potential,” and a set of anchor points from which
the rater selects the most appropriate (e.g., low, Two terms are often used in discussing person-
average, or high). Rating scales can be used to ality, particularly in psychological testing. When
assess a wide variety of variables, not just per- we assess an individual with a personality test,
sonality dimensions. Because ratings are quite that test will presumably measure some variable
often used in occupational settings, for example or combination of variables – perhaps sociability,
a manager rating employees, we discuss ratings introversion-extraversion, self-control, assertive-
in Chapter 14. ness, nurturance, responsibility, and so on. Ordi-
narily, we assume that individuals occupy differ-
Situational methods. Sometimes, the personal- ent positions on the variable, that some people
ity of an individual can be assessed through direct are more responsible than others, and that our
observation of the person in a specific situa- measurement procedure is intended to identify
tion. In self-report, the person has presumably with some degree of accuracy a person’s posi-
observed his or her behavior in a large variety of tion on that variable. The variable, assumed to
situations. In ratings, the observer rates the per- be a continuum, is usually called a trait. (For an
son based again on a range of situations, although excellent discussion of trait see Buss, 1989.)
the range is somewhat more restricted. In situ- As you might expect, there is a lot of argu-
ational methods, the observation is based on a ment about whether such traits do or do not exist,
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

70 Part Two. Dimensions of Testing

whether they reside in the person’s biochemistry much blood, choleric (irritable) due to too much
or are simply explanatory constructs, whether yellow bile, and phlegmatic (apathetic) due to too
they are enduring or transitory, whether the con- much phlegm.
cept of trait is even useful, and to what degree
traits are found in the person or in the interac-
TYPES OF PERSONALITY TESTS
tion between person and environment (e.g., R. B.
Cattell, 1950; Hogan, DeSoto, & Solano, 1977; The internal consistency approach. As dicussed
Holt, 1971; Mischel, 1968; 1977). In the 1960s and in Chapter 2, there are a number of ways of
1970s, the notion of personality traits came under constructing tests, and this is particularly true
severe attack (e.g., D’Andrade, 1965; Mischel, of personality tests. One way to develop tests,
1968; Mulaik, 1964; Ullman & Krasner, 1975), but sometimes called the method of internal con-
it seems to have reemerged recently (e.g., Block, sistency or the inductive method, is to use sta-
Weiss, & Thorne, 1979; Goldberg, 1981; Hogan, tistical procedures such as factor analysis. Basi-
1983; McCrae & Costa, 1985). McCrae and Costa cally, the method is to administer a pool of
(1986) point out that the trait approach, attacked items to a sample or samples of individuals,
so vigorously, has survived because it is based on and to statistically analyze the relationship of
the following set of assumptions that are basically responses to the items to determine which items
valid: go together. The resulting set of variables presum-
ably identifies basic factors. One of the pioneers
1. Personality is generally marked by stability and of this approach was J. P. Guilford who developed
regularity (Epstein, 1979). the Guilford-Martin Inventory of Factors and
2. Personality is relatively stable across the age the Guilford-Zimmerman Temperament Survey
span; people do change, but rarely are changes (Guilford, 1959b). In this approach, the role of
dramatic (Block, 1981). theory is minimal. While the author’s theory may
3. Personality traits do predict behavior (Small, play a role in the formation of the initial pool
Zeldin & Savin-Williams, 1983). of items, and perhaps in the actual naming of
4. These traits can be assessed with a fair degree the factors, and in what evidence is sought to
of accuracy both by self-reports and by ratings determine the validity of the test, the items are
(McCrae & Costa, 1987). assigned to specific scales (factors) on the basis
of statistical properties. A good example of this
A type is a category of individuals all of whom approach is the 16 Personality Factors Inventory
presumably share a combination of traits. Most (16PF), described later in this chapter.
psychologists prefer to think of traits as dis-
tributed along a normal curve model, rather than The theoretical approach. A second method of
in dichotomous or multi types. Thus, we think test construction is called the theoretical or deduc-
of people as differing in the degree of honesty, tive method. Here the theory plays a paramount
rather than there being two types of people, hon- role, not just in the generation of an item pool,
est and dishonest. However, from a theoretical but in the actual assignment of items to scales,
point of view, a typology may be a useful device to and indeed in the entire enterprise. We look
summarize and categorize behavior. Thus, most at three examples of this approach: the Myers-
typologies stem from theoretical frameworks that Briggs Type Indicator (MBTI), the Edwards Per-
divide people into various categories, with the sonal Preference Schedule (EPPS), and the Person-
full understanding that “pure” types probably do ality Research Form (PRF).
not exist, and that the typology is simply a conve-
nient device to help us understand the complex- Criterion-keying. A third approach is that of
ity of behavior. One of the earliest typologies was empirical criterion-keying, sometimes called the
developed by the Greeks, specifically Hippocrates method of contrasted groups, the method of cri-
and Galen, and was based on an excess of body terion groups, or the external method (Goldberg,
“humors” or fluids: Thus, there were individu- 1974). Here the pool of items is administered to
als who were melancholic (depressed) due to too one or more samples of individuals, and crite-
much dark bile, sanguine (buoyant) due to too rion information is collected. Items that correlate
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 71

significantly with the criterion are retained. Often Importance of language. Why select a partic-
the criterion is a dichotomy (e.g., depressed vs. ular variable to measure? Although, as we said,
nondepressed; student leader vs. not-a-leader), measurement typically arises out of need, one
and so the contrasted groups label is used. But may well argue that some variables are more
the criterion may also be continuous (e.g., GPA, important than others, and that there is a greater
ratings of competence, etc.). Presumably the pro- argument for scaling them. At least two psychol-
cess could be atheoretical because whether an ogists, Raymond Cattell and Harrison Gough,
item is retained or not is purely an empiri- well known for their personality inventories, have
cal matter, based on observation rather than argued that important variables that reflect sig-
predilection. The basic emphasis of this empirical nificant individual differences become encoded
approach is validity-in-use. The aim is to develop in daily language. If for example, responsibil-
scales and inventories that can forecast behavior ity is of importance, then we ought to hear lay
and that will identify people who are described people describe themselves and others in terms
by others in specific ways. Empiricists are not of responsibility, dependability, punctuality, and
tied to any one particular method or approach, related aspects. To understand what the basic
but rather seek what is most appropriate in a dimensions of importance are, we need to pay
particular situation. The outstanding example attention to language because language encodes
of a criterion-keyed inventory is the California important experiences.
Psychological Inventory (CPI) discussed in this
chapter. Psychopathology. Many theories of personality
and ways of measuring personality were origi-
The fiat method. Finally, a fourth approach that nally developed in clinical work with patients.
we identified as the fiat method, is also referred For example, individuals such as Freud, Jung,
to as the rational or logical approach, or the and Adler contributed to much of our under-
content validity approach. Here the test author standing about basic aspects of personality, but
decides which items are to be incorporated in their focus was primarily on psychopathology
the test. The first psychologists who attempted to or psychological disturbances. Thus, there is a
develop personality tests assumed that such tests substantial area of personality assessment that
could be constructed simply by putting together focuses on the negative or disturbed aspects of
a bunch of questions relevant to the topic and personality; the MMPI is the most evident exam-
that whatever the respondent endorsed was in ple of a personality inventory that focuses on psy-
direct correspondence to what they did in real chopathology. These instruments are discussed
life. Thus a measure of leadership could be con- in Chapter 15.
structed simply by generating items such as, “I
am a leader,” “I like to be in charge of things,” Self-actualization. Other theorists have focused
“People look to me for decisions,” etc. Few such on individuals who are unusually effective, who
tests now exist because psychology has become perhaps exhibit a great deal of inventiveness
much more empirical and demanding of evi- and originality, who are self-fulfilled or self-
dence, and because many of the early personality actualized. One example of such a theorist is
tests built using this strategy were severely criti- Abraham Maslow (1954; 1962). We look at tests
cized (Landis, 1936; Landis, Zubin, & Katz, 1935). that might fall under this rubric in Chapter 8.
Quite often, “tests” published in popular maga-
zines are of this type. There is, of course, nothing Focus on motivation. One of the legacies of
inherently wrong with a rational approach. It Freud is the focus on motivation – on what moti-
makes sense to begin rationally and to be guided vates people, and on how can these motives be
by theory, and a number of investigators have assessed. Henry A. Murray (1938) was one indi-
made this approach their central focus (e.g., vidual who both theoretically and empirically
Loevinger, 1957). Perhaps it should be pointed focused on needs, those aspects of motivation that
out that many tests are the result of combined result in one action rather than another (skip-
approaches, although often the author’s “bias” ping lunch to study for an exam). Murray realized
will be evident. that the physical environment also impinges on
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

72 Part Two. Dimensions of Testing

Table 4–1. The dimensions of the 16PF Each factor was initially given
a letter name and descrip-
Factor Factor name Brief explanation
tive names were not assigned
A Schizothymia- Reserved vs. outgoing for a number of years, in
affectothymia
part because R. B. Cattell felt
B Intelligence
C Ego strength Emotional stability that these descriptive labels
E Submissiveness-dominance are quite limited and often
F Desurgency-surgency Sober-enthusiastic people assign them mean-
G Superego strength Expedient-conscientious ings that were not necessarily
H Threctia-Parmia Shy-uninhibited
there to begin with. In fact, as
I Harria-Premsia Tough minded vs. tender
minded you will see, when R. B. Cat-
L Alaxia-protension Trusting-suspicious tell named the factors he typ-
M Praxernia-Autia Practical-imaginative ically used descriptive labels
N Artlessness-shrewdness Unpretentious-astute that are not popular.
O Untroubled adequacy-guilt Self-assured vs. worrying
proneness
Q1 Conservative-radical Description. The 16PF is
Q2 Group adherence Joiner-self sufficient designed for ages 16 and
Q3 Self-sentiment integration Undisciplined-controlled older, and yields scores for
Q4 Ergic tension Relaxed-tense the 16 dimensions listed in
Table 4.1.
behavior, and therefore we need to focus on the As shown in Table 4.1, each of the factors is
environmental pressures or press that are exerted identified by a letter, and then by a factor name.
on the person. Both the EPPS and the PRF were These names may seem quite strange, but they
developed based on Murray’s theory. are the words that R. B. Cattell chose. For those
names that are not self-evident, there is also a
brief explanation in more familiar terms.
EXAMPLES OF SPECIFIC TESTS Six forms of the test are available, two of which
are designed for individuals with limited educa-
The Cattell 16PF tion. The forms of the 16PF contain 187 items
and require about 45 to 60 minutes for admin-
Introduction. How many words are there in the istration (25–35 minutes for the shorter forms
English language that describe personality? As of 105 items). Since its original publication, the
you might imagine, there are quite a few such 16PF has undergone five revisions. Some forms
words. Allport and Odbert (1936) concluded that of the 16PF contain validity scales, scales that are
these words could actually be reduced to 4,504 designed to assess whether the respondent is pro-
traits. R. B. Cattell (1943) took these traits and ducing a valid protocol, i.e., not faking. These
through a series of procedures, primarily factor scales include a “fake-bad” scale, a “fake-good”
analysis, reduced them to 16 basic dimensions or scale, and a random-responses scale.
source traits. The result was the Sixteen Personality The 16 dimensions are said to be independent,
Factor Questionnaire, better known as the 16PF and each item contributes to the score of only one
(R. B. Cattell, A. K. Cattell, & H. E. Cattell, 1993). scale. Each of the scales is made up from 6 to 13
items, depending on the scale and the test form.
Development. The 16PF was developed over a The items are 3-choice multiple-choice items,
number of years with a variety of procedures. The or perhaps more correctly, forced-choice options.
guiding theory was the notion that there were 16 An example of such an item might be: If I had
basic dimensions to personality, and that these some free time I would probably (a) read a good
dimensions could be assessed through scales book, (b) go visit some friends, (c) not sure. R. B.
developed basically by factor analysis. A great deal Cattell, Eber, and Tatsuoka (1972) recommend
of work went into selecting items that not only that at least two forms of the test be administered
reflected the basic dimensions, but that would to a person to get a more valid measurement, but
be interesting for the subject and not offensive. in practice, this is seldom done.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 73

Administration. The 16PF is basically a self- designed to measure. These coefficients range
administered test, and requires minimal skills on from a low of .35 to a high of .94, with the majority
the part of the examiner to administer; interpre- of coefficients in the .60 to .80 range. The liter-
tation of the results is of course a different matter. ature however, contains a multitude of studies,
many that support the construct validity of the
Scoring. Scoring of the 16PF can be done by 16PF.
hand or by machine, and is quite straightfor-
ward – each endorsed keyed response counts 1 Norms. Three sets of norms are available, for
point. As with most other tests that are hand- high-school seniors, college students, and adults.
scored, templates are available that are placed on These norms are further broken down into sepa-
top of the answer sheet to facilitate such scoring. rate gender norms, and age-specific norms. These
Raw scores on the 16PF are then converted to norms are based on more than 15,000 cases strat-
stens, a contraction of standard ten, where scores ified according to U.S. census data. Thus, these
can range from 1 to 10, and the mean is fixed were not simply samples of convenience; the data
at 5.5; such conversions are done by using tables was gathered according to a game plan. R. B.
provided in the test manual, rather than by doing Cattell, Eber, and Tatsuoka (1970) present norms
the actual calculations. Despite the strange names for a very large number of occupational sam-
given to the 16 factors, for each of the scales, the ples ranging from accountants to writers. For
test manual gives a good description of what a example, they tested a sample of 41 Olympic
low scorer or a high scorer might be like as a per- champion athletes. As a group, these individuals
son. A number of computer scoring services are showed high ego strength (Factor C), high domi-
available. These can provide not only a scoring nance (Factor E); low superego (Factor G), and an
of the scales and a profile of results, but also a adventurous temperament (Factor H). Football
narrative report, with some geared for specific players are described as having lower intelligence
purposes (for example, selection of law enforce- (Factor B), scoring lower on factor I (harria), fac-
ment candidates). tor M (praxernia), and factor Q2 (group adher-
ence). In English, these players are described as
Reliability. An almost overwhelming amount of alert, practical, dominant, action-oriented, and
information on the 16PF can be found in the group-dependent.
Handbook for the 16PF (R. B. Cattell, Eber, &
Tatsuoka, 1970), in the test manual, in the profes- Interesting aspects. Despite the fact that the
sional literature, and in a variety of publications 16PF scales were developed using factor analy-
from the test publisher. Internal consistency of sis and related techniques designed to result in
the scales is on the low side, despite the focus independent measures, the 16 scales do correlate
on factor analysis, and scales are not very reliable with each other, some rather substantially. For
across different forms of the test – i.e., alternate example, Factors O and Q4 correlate +.75; fac-
form reliability (Zuckerman, 1985). Information tors G and Q3 + .56; and factors A and H + .44,
about test-retest reliability, both with short inter- just to cite some examples (R. B. Cattell, Eber, &
vals (2 to 7 days) and longer intervals (2 to Tatsuoka, 1970, p. 113). Thus, there seems to be
48 months) is available and appears adequate. some question whether the 16 scales are indeed
The correlation coefficients range from the .70s independent (Levonian, 1961).
and .80s for the brief interval, to the .40s and In addition to the 16 primary traits, other pri-
.50s for a 4-year interval. This is to be expected mary traits have been developed (at least 7) but
because test-retest reliability becomes lower as have not been incorporated into the 16PF. In
the interval increases, and in fact, a 4-year inter- addition, factor analysis of the original 16 pri-
val may be inappropriate to assess test stability, mary traits yields a set of 8 broader secondary
but more appropriate to assess amount of change. traits. The 16PF can be scored for these secondary
traits, although hand scoring is somewhat cum-
Validity. The test manual gives only what may be bersome.
called factorial validity, the correlation of scores The 16PF has also resulted in a whole family
on each scale with the pure factor the scale was of related questionnaires designed for use with
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

74 Part Two. Dimensions of Testing

children, adolescents, and clinical populations of coming to conclusions about what has been
(e.g., Delhees & R. B. Cattell, 1971). perceived.
A number of investigators have applied the One basic question then, is whether an indi-
16PF to cross-cultural settings, although more vidual tends to use perception or judgment in
such studies are needed (M. D. Gynther & R. A. dealing with the world. Perception is composed
Gynther, 1976). of sensing, becoming aware directly through the
A substantial amount of research with the 16PF senses of the immediate and real experiences of
has been carried out, primarily by R. B. Cattell, life, and of intuition, which is indirect percep-
his colleagues, and students. One of the intrigu- tion by way of unconscious rather than conscious
ing areas has been the development of a number processes, becoming aware of possibilities and
of regression equations designed to predict a vari- relationships. Judgment is composed of thinking,
ety of criteria, such as academic achievement and which focuses on what is true and what is false,
creativity. on the objective and impersonal, and of feeling,
which focuses on what is valued or not valued,
what is subjective and personal. Finally, there is
Criticisms. The 16PF has been available for quite
the dimension of extraversion or introversion; the
some time and has found extensive applica-
extraverted person is oriented primarily to the
tions in a wide variety of areas. Sometimes
outer world and therefore focuses perception and
however, there has been little by way of repli-
judgment upon people and objects. The intro-
cation of results. For example, the 16PF Hand-
verted person focuses instead on the inner world,
book presents a number of regression equations
the world of ideas and concepts.
designed to predict specific behaviors, but most
The manner in which a person develops is a
of these regression equations have not been tested
function of both heredity and environment, as
to see if they hold up in different samples.
they interact with each other in complex ways,
The short forms are of concern because each
although Jung seemed to favor the notion of a
scale is made up of few items, and short scales
predisposition to develop in a certain way (Jung,
tend to be less reliable and less valid. In fact, the
1923). Once developed, types are assumed to be
data presented by the test authors substantiate
fairly stable.
this concern, but a new test user may not per-
Type is conceived to be categorical, even
ceive the difference between short forms and long
though the extent to which a person has devel-
forms in reliability and validity.
oped in a particular way is continuous; a person
is seen, in this schema, as being either sensing
or intuitive, either thinking or feeling (Stricker &
The Myers-Briggs Type Indicator (MBTI)
Ross, 1964a; 1964b).
Introduction. Jung’s theory and writings have
had a profound influence on psychology, but Development. The MBTI was developed by
not as much in the area of psychological test- Katharine Cook Briggs and her daughter, Isabel
ing. With some minor exceptions, most efforts Briggs Myers. Myers in 1942 began to develop
in psychological testing stemming from Jungian specific items for possible use in an inventory.
theory have focused on only one concept, that of From 1942 to about 1957, she developed a num-
extraversion-introversion. The MBTI is unique ber of scales, did major pilot testing, and eventu-
in that it attempts to scale some important con- ally released MBTI. In 1962, Educational Testing
cepts derived from Jungian theory. The MBTI is Service published form F for research use only.
a self-report inventory designed to assess Jung’s In 1975, Consulting Psychologists Press took over
theory of types. Jung believed that what seems the publication of form F and in 1977 published
to be random variation in human behavior is in form G, both for professional use. In 1975 also, a
fact orderly and consistent and can be explained center for the MBTI was opened at the University
on the basis of how people use perception and of Florida in Gainesville.
judgment. Perception is defined as the processes
of becoming aware – aware of objects, people, Description The MBTI is geared for high-school
or ideas. Judgment is defined as the processes students, college students, and adults. Form G
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 75

of the MBTI consists of some 126 forced-choice and the obtained coefficients are of the same
items of this type: Are you (a) a gregarious person; magnitude as those found with most multivariate
(b) a reserved and quiet person, as well as a num- instruments.
ber of items that ask the respondent to pick one
of two words on the basis of appeal – for example: Validity. Considerable validity data is presented
(a) rational or (b) intuitive. in the test manual (I. B. Myers & McCaulley,
Form F consists of 166 items, and there is an 1985), especially correlations of the MBTI scales
abbreviated form of 50 items (form H). There with those on a variety of other personality tests,
is also a self-scorable short form composed of career-interest inventories, self-ratings, as well as
94 items from form G. This form comes with a behavioral indices. In general, all of the evidence
two-part answer sheet that allows the respondent is broadly supportive of the construct validity of
to score the inventory. Finally, there is a form the MBTI, but there are exceptions. For example,
designed for children, as well as a Spanish version. Stricker and Ross (1964a; 1964b) compared the
There are thus four scales on the MBTI: MBTI with a large battery of tests administered
Extraversion-introversion abbreviated as E-I to an entering class of male students at Wesleyan
Sensation-intuition abbreviated as S-N University. The construct validity of each MBTI
Thinking-feeling abbreviated as T-F scale was assessed by comparing the scores with
Judging-perceiving abbreviated as J-P measures of personality, ability, and career inter-
est. The findings are interpreted by the authors to
somewhat support the validity of the Sensation-
Administration. Like the 16PF, the MBTI can be
Intuition and Thinking-Feeling scales, but not
easily administered, and simply requires the sub-
for the Extraversion-Introversion and Judging-
ject to follow the directions. There is no time
Perceiving scales. (Extensive reviews of the relia-
limit, but the MBTI can be easily completed in
bility and validity of the MBTI can be found in
about 20 to 30 minutes. The MBTI requires a
J. G. Carlson, 1985 and in Carlyn, 1977.)
seventh-grade reading level.

Scoring. Although continuous scores are Norms. In one sense, norms are not relevant to
obtained by summing the endorsed keyed this test. Note first of all, that these are ipsative
responses for each scale, individuals are char- scales – the higher your score on E the lower on
acterized as to whether they are extraverted or I. Thus basically, the subject is ranking his/her
introverted, sensation type or intuition type, own preferences on each pair of scales. To com-
etc., by assigning the person to the highest score plicate matters, however, the scales are not fully
in each pair of scales. Preferences are designated ipsative, in part because some items have more
by a letter and a number to indicate the strength than two response choices, and in part because
of the preference – for example, if a person responses represent opposing rather than com-
scores 26 on E and 18 on I, that person’s score peting choices (DeVito, 1985). In addition, as was
will be “8E”; however, typically the letter is mentioned above, the focus is on the types rather
considered more important than the number, than the scores. We could of course ask how fre-
which is often disregarded. The MBTI does not quent is each type in specific samples, such as
try to measure individuals or traits, but rather architects, lawyers, art majors, and so on, and
attempts to sort people into types. There are both the manual and the literature provide such
thus 16 possible types, each characterized by a information.
four-letter acronym, such as INTJ or ISTP.
Interesting aspects. Jungian theory has always
Reliability. Alpha coefficients and split-half reli- had wide appeal to clinicians, and so the MBTI
abilities are given in the test manual (I. B. Myers has found quite a following with counselors, ther-
& McCaulley, 1985), while test-retest reliability apists, motivational consultants, and others who
studies have been reported in the literature (e.g., work directly with clients. In fact, it has become
Carlyn, 1977; Stricker & Ross, 1964a; 1964b). In somewhat of a “cult” instrument, with a small but
general, the results suggest adequate reliability, enthusiastic following, its own center to continue
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

76 Part Two. Dimensions of Testing

the work of Isabel Myers Briggs, and its own jour- Heilbrun, 1965) and the Thematic Apperception
nal named Research in Psychological Type. Test (H. A. Murray, 1943). A second theoreti-
The MBTI manual (I. B. Myers & McCaulley, cal focus is the issue of social desirability. A. L.
1985) gives considerable information for the psy- Edwards (1957b) argued that a person’s response
chometrically oriented user, but it is clear that the to a typical personality inventory item may be
focus of the Manual is on the applied use of the more reflective of how desirable that response
MBTI with individual clients in situations such is than the actual behavior of the person. Thus
as personal and/or career counseling. Thus, there a true response to the item, “I am loyal to my
are detailed descriptions of the 16 pure types in friends” may be given not because the person is
terms of what each type is like, and there are pre- loyal, but because the person perceives that saying
sumed “employment aspects” for each type; for “true” is socially desirable.
example, introverts are said to be more careful
with details, to have trouble remembering names Development. A. L. Edwards developed a pool
and faces, and like to think before they act, as of items designed to assess 15 needs taken from
opposed to extroverts who are faster, good at H. A. Murray’s system. Each of the items was rated
greeting people, and usually act quickly. by a group of judges as to how socially desir-
Nevertheless, we can still ask some “psycho- able endorsing the item would be. Edwards then
metric” questions, and one of these is: How inde- placed together pairs of items that were judged to
pendent are the four sets of scales? Intercorrela- be equivalent in social desirability, and the task
tions of the four scales indicate that three of the for the subject was to choose one item from each
scales are virtually independent, but that JP cor- pair.
relates significantly with SN, with typical corre-
lation coefficients ranging from about .26 to .47; Description. Each of the scales on the EPPS is
one way to interpret this is that intuitive types are then composed of 28 forced-choice items, where
more common among perceptive types – the two an item to measure need Achievement for exam-
tend to go together. ple, is paired off with items representative of
each of the other 14 needs, and this done twice
Criticisms. One basic issue is how well the test per comparison. Subjects choose from each pair
captures the essence of the theory. Jungian the- the one statement that is more characteristic of
ory is complex and convoluted, the work of a them, and the chosen underlying need is given
genius whose insights into human behavior were one point. Let’s assume for example, that these
not expressed as easily understood theorems. The two statements are judged to be equal in social
MBTI has been criticized because it does not mir- desirability:
ror Jungian theory faithfully; it has also been crit-
Which of these is most characteristic? (a) I find it reas-
icized because it does, and therefore is of interest
suring when friends help me out; (b) It is easy for me
only if one accepts the underlying theory (see
to do what is expected.
McCaulley, 1981, and J. B. Murray, 1990, for
reviews). If you chose statement (a) you would receive one
point for need Succorance; if you chose statement
(b) you would receive a point for need Deference.
The Edwards Personal Preference
Note again, that this procedure of having to
Schedule (EPPS)
choose (a) vs. (b) results in ipsative measurement;
Introduction. There are two theoretical influ- the resulting score does not reflect the strength
ences that resulted in the creation of the EPPS. of a need in any “absolute” manner, but rather
The first is the theory proposed by Henry Murray whether that need was selected over the other
(1938) which, among other aspects, catalogued a needs. Why is this point important? Suppose
set of needs as primary dimensions of behavior – you and a friend enter a restaurant and find five
for example, need achievement, need affiliation, choices on the menu: hamburger, salad, fish-
need heterosexuality. These sets of needs have sticks, taco, and club sandwich. You may not
been scaled in a number of instruments such care very much for any of those, but you select a
as the EPPS, the Adjective Check List (Gough & hamburger because it seems the most palatable.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 77

Table 4–2. The EPPS Scales range from +.74 for need
Achievement and need Exhi-
Need Brief definition
bition, to +.88 for need
1. Achievement To achieve, to be successful Abasement.
2. Deference To follow, to do what is expected
3. Order To be orderly and organized
4. Exhibition To be at the center of attention Validity. The test manual
5. Autonomy To be independent presents little data on valid-
6. Affiliation To have friends
ity, and many subsequent
7. Intraception To analyze one’s self and others
8. Succorance To be helped by others studies that have used the
9. Dominance To be a leader EPPS have assumed that the
10. Abasement To accept blame scales were valid. The results
11. Nurturance To show affection and support do seem to support that
12. Change To need variety and novelty
assumption, although there
13. Endurance To have persistence
14. Heterosexuality To seek out members of the opposite sex is little direct evidence of the
15. Aggression To be aggressive, verbally and/or physically validity of the EPPS.

Your friend however, simply loves hamburgers Norms. Because the EPPS
and his selection reflects this. Both of you chose consists of ipsative measurement, norms are not
hamburgers but for rather different reasons. We appropriate. Nevertheless, they are available and
should not assume that both of you are “ham- used widely, although many would argue, incor-
burger lovers,” even although your behavior rectly. The initial normative sample consisted of
might suggest that. Similarly, two people might 749 college women and 760 college men enrolled
score equally high on need aggression, but only in various universities. The subjects were selected
one of them might be an aggressive individual. to yield approximately equal representation of
In terms of the classificatory schema we devel- gender and as wide an age spread as possible,
oped in Chapter 1, the EPPS, like most other per- as well as different majors. Basically then, the
sonality inventories, is commercially available, a sample was one of convenience and not ran-
group test, a self-report paper-and-pencil inven- dom or stratified. The manual also gives a table
tory, with no time limit, designed to assess what that allows raw scores to be changed into per-
the subject typically does, rather than maximal centiles. Subsequently, the revised manual also
performance. gives norms for 4,031 adult males and 4,932
The EPPS is designed primarily for research adult females who were members of a consumer
and counseling purposes, and the 15 needs that purchase panel participating in a market sur-
are scaled are presumed to be relatively indepen- vey. These norms are significantly different from
dent normal personality variables. Table 4.2 gives those presented for college students; part of the
a list of the 15 needs assessed by the EPPS. difference may be that the adult sample seems to
be somewhat more representative of the general
Administration. The EPPS is easy to administer population.
and is designed to be administered within the typ-
ical 50-minute class hour. There are two answer Interesting aspects. The EPPS contains two
sheets available, one for hand scoring and one for validity indices designed to assess whether a par-
machine scoring. ticular protocol is valid or not. The first index
is based on the fact that 15 items are repeated;
Reliability. The test manual gives both inter- the responses to these items are compared and
nal consistency (corrected split-half coefficients a consistency score is determined. If the subject
based on a sample of 1,509 subjects), and test- answers at least 11 of the 15 sets consistently, then
retest coefficients (1-week interval, n = 89); the it is assumed that the subject is not responding
corrected split-half coefficients range from +.60 randomly. Interestingly, in the normative sample
for the need Deference scale to +.87 for the need of 1,509 college students, 383 (or 25%) obtained
Heterosexuality scale. The test-retest coefficients scores of 10 or below.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

78 Part Two. Dimensions of Testing

The second validity index, an index of profile theory of H. A. Murray (1938) and in the fact
stability, is obtained by correlating partial scores that it assesses needs.
for each scale (based on 14 items) with the other
14 items. A correlation coefficient of at least +.44
Development. The development of the PRF
across scales is assumed to indicate profile sta-
shows an unusual degree of technical sophistica-
bility, and in fact 93% of the normative sam-
tion and encompasses a number of steps imple-
ple scored at or above this point. The calculation
mented only because of the availability of high-
of this coefficient, if done by hand, is somewhat
speed computers. D. N. Jackson (1967) indicates
involved, and few if any test users do this.
that there were four basic principles that guided
What about the equating of the items on social
the construction of the PRF:
desirability? Note first, that the equating was
done on the basis of group ratings. This does
1. Explicit and theoretically based definitions of
not guarantee that the items are equated for the
each of the traits;
individual person taking the test (Heilbrun &
Goodstein, 1961). Secondly, placing two “equal” 2. Selection of items from a large item pool, with
items together may in fact cause a shift in social more than 100 items per scale, with selection
desirability, so that one of the items may still be based on homogeneity of items;
seen as more socially desirable (McKee, 1972). 3. The use of procedures designed to eliminate
The 15 need scales are designed to be inde- or control for such response biases as social
pendent. A. L. Edwards (1959) gives a matrix of desirability;
correlations based on the normative sample of 4. Both convergent and discriminant validity
1,509 college students. Most of the correlation were considered at every stage of scale develop-
coefficients are low and negative, but this is due ment, rather than after the scale was developed.
to the nature of the test – the higher a person
scores on one need, the lower they must score In constructing the PRF, D. N. Jackson (1967)
on the other needs (if you select butter pecan ice used a series of steps quite similar to the ones
cream as your favorite flavor, other flavors must outlined in Chapter 2:
be ranked lower). The largest coefficient reported
is between need Affiliation and need Nurturance 1. Each of the traits (needs) was carefully studied
(r = .46). The generally low values do support in terms of available theory, research, etc.;
A. L. Edwards’ claim that the scales are relatively 2. A large pool of items was developed, with each
independent. item theoretically related to the trait;
3. These items were critically reviewed by two or
Criticisms. The criticisms of the EPPS are many; more professionals;
some are minor and can be easily overlooked, 4. Items were administered to more than a thou-
but some are quite major (e.g., Heilbrun, 1972; sand subjects, primarily college students;
McKee, 1972). The use of ipsative scores in a nor- 5. A series of computer programs were written
mative fashion is not only confusing but incor- and used in conducting a series of item analyses;
rect. The relative lack of direct validity evidence 6. Biserial correlations were computed between
can be changed but it hasn’t, even although the each item, the scale on which the item presum-
EPPS has been around for some time. In general, ably belonged, scales on which the item did not
the EPPS seems to be fading away from the testing belong, and a set of items that comprised a ten-
scene, although at one time it occupied a fairly tative social desirability scale;
central position. 7. Items were retained only if they showed a
higher correlation with the scale they belonged
to than any of the other scales;
The Personality Research Form (PRF)
8. Finally, items were retained for the final scales
Introduction. The PRF (Jackson, 1967) is that showed minimal relation to social desirabil-
another example of the theoretical approach ity, and also items were balanced for true or false
and shares with the EPPS its basis on the need as the keyed response.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 79

The result of these steps is a set of scales that made up of 20 items. To be really correct, the reli-
have high internal consistency and minimal over- ability coefficients should have either been com-
lap and are relatively free from response biases of puted on 20-item scales, or should have been cor-
acquiescence and social desirability. rected by the Spearman-Brown formula. Despite
this, the coefficients are quite acceptable, with the
Description. When first published in 1967, the exception of the Infrequency scale.
PRF consisted of two parallel 440-item forms Test-retest reliabilities are also presented for a
(forms AA and BB) and two parallel 300-item sample of 135 individuals retested with a 1-week
forms (forms A and B). In 1974, a revised and interval. Coefficients range from a low of +.46
simplified 352-item version (form E) was pub- (again for the Infrequency scale) to a high of .90,
lished, and in 1984, form G was published for with more than half of the coefficients in the .80s
use in business and industry. range. Odd-even reliabilities are also presented,
The PRF is designed to focus on normal func- with slightly lower coefficients.
tioning, but its primary focus was personality
research and, secondly, applied work in various Validity. D. N. Jackson (1967; 1984) presents
settings such as educational and business settings. considerable convergent validity data for the
Its scales, 15 or 22 depending on the form, of PRF. One set of studies consists of comparisons
which 12 are identical in name with those on between PRF scores and ratings both by observers
the EPPS, basically focus on seven areas of nor- and by the subjects themselves on the same scales;
mal functioning: (1) impulse expression and con- correlation coefficients range from a low of +.10
trol, (2) orientation toward work and play, (3) to a high of +.80, with many of the coefficients
degree of autonomy, (4) intellectual and aesthetic in the .30 to .60 range. Correlations are also pre-
style, (5) dominance, (6) interpersonal orienta- sented for PRF scales with scales of the Strong
tion, and (7) test-taking validity. Vocational Interest Blank (SVIB) (most coeffi-
The last area, test-taking validity, is composed cients are quite low as one would expect because
of two scales, Desirability and Infrequency; the the SVIB measures career interests), and with the
Desirability scale assesses social desirability, or California Psychological Inventory (CPI), where
the tendency to respond on the test desirably or high correlations are obtained where expected;
undesirably. The Infrequency scale is designed for example, the PRF need Dominance scale and
to identify carelessness or other “nonpurposeful” the CPI Dominance scale correlate +.78.
responding, and consists of items for which there
is a clear modal answer, such as “I am unable to Norms. D. N. Jackson (1967) presents norms
breathe.” based on 1,029 males and 1,002 females, presum-
ably college students.
Administration. The PRF can be easily adminis-
tered to large groups and has clear instructions. Interesting aspects. The PRF has been hailed
There are no time limits, and the short form can as a personality inventory that is very sophisti-
be easily completed in about an hour. cated in its development. Although it has been
available for some time, it is not really a popular
Scoring. Both hand scoring and machine scoring test, especially among practitioners. For exam-
are available. ple, Piotrowski and Keller (1984) inquired of
all graduate programs that train doctoral stu-
Reliability. Because the development of the PRF dents in clinical psychology as to which tests
consisted of some steps designed to select items should a clinical PhD candidate be familiar with.
that correlated highly with total scale scores, one The PRF was mentioned by only 8% of those
would expect the reliability of the PRF, at least as responding.
measured by internal consistency methods, to be The test manual does not make it clear why
high. D. N. Jackson (1967) does list the Kuder- both short forms and long forms of the PRF
Richardson coefficients for the 22 scales, but the were developed. Strictly speaking, these are not
coefficients are inflated because they are based on short forms but abbreviated forms that assess
the best 40 items for each scale, but each scale is only 15 of the 22 scales. The parallel forms
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

80 Part Two. Dimensions of Testing

represent a potential plus, although in personality cultures, the important dimensions of behavior
assessment there are probably few occasions have become encapsulated in the language that
where alternate forms might be useful. In addi- people use to describe themselves, others, and
tion, the revised version (form E) apparently does behavior. These dimensions have survived the
not have a parallel form. As with most multi- test of time, and do not reflect fads or ephemeral
variate instruments, the PRF has been subjected theories, but important dimensions of personal-
to factor analysis (see P. C. Fowler, 1985; D. N. ity functioning that we, as social scientists, should
Jackson, 1970). pay attention to. These dimensions are labeled by
Gough as folk concepts.
Criticisms. Hogan (1989a) and Wiggins (1989) 2. How many scales are needed in an inventory?
reviewed the PRF and they, like other reviewers, In one sense, this is the question of how many
cited a number of problems. Perhaps the major basic dimensions of psychological functioning
criticisms concern the lack of validity studies and there are. Rather than provide a specific num-
of noncollege normative data. Both of these can ber, as many others do, Gough prefers the use of
be remedied, but it is somewhat surprising that an open system that allows the development of
they have not, given that the PRF has now been new scales; or as Gough succinctly states, there
available to researchers for some 30 years. should be “enough scales to do the job the inven-
Another issue is the choice of Murray’s needs tory is intended to do.” Some new scales for or
as the variables that were scaled. Hogan (1989a) on the CPI have been developed (e.g., Hakstian
suggests that these variables were chosen because & Farrell, 2001), although nowhere near the large
“they were there,” rather than intrinsic utility or number of new MMPI scales.
theoretical preference. In short, as Hogan (1989a) 3. How should the scales be conceptualized?
suggests, despite the technical excellence of the Rather than take a factor analytic approach,
PRF, the CPI or the MBTI may be more useful to Gough uses primarily the empirical method of
the practitioner. criterion-keying and argues that the CPI scales
are “instrumental” – that is, they have only two
The California Psychological purposes: (a) to predict what people will say and
Inventory (CPI) do in specific contexts, and (b) to identify peo-
ple who are described by others in specified ways
Introduction. In the survey of clinical psychol- (e.g., competent, friendly, leaders, etc.). There is
ogy programs (Piotrowski & Keller, 1984) men- nothing claimed here about the assessment of
tioned before, the most popular personality traits, or internal item homogeneity, or other
inventory mentioned was the MMPI, which was traditional ways of thinking about personality
listed by 94% of the respondents. The second assessment.
most popular was the CPI which was mentioned
4. How should the scales relate to each other?
by 49%. Thus, despite its focus on normality, the
Most psychologists would reply that the scales be
CPI is considered an important instrument by
uncorrelated, even although the empirical evi-
clinicians, and indeed it is. Surveys done with
dence suggests that most “uncorrelated” scales
other professional groups similarly place the CPI
do correlate. Gough argues that independence is
in a very high rank of usefulness, typically second
a preference and not a law of nature, and he argues
after the MMPI.
that the scales should correlate to the same degree
The author of the CPI, Harrison Gough, indi-
as the underlying concepts do in everyday usage.
cates (personal communication, August 3, 1993)
If we tend to perceive leaders as more sociable,
that to understand the CPI there are five “axioms”
and indeed leaders are more sociable, then scores
or basic notions that need attention:
on a scale of leadership and on one of sociability
1. The first is the question, “what should be should in fact correlate.
measured?” We have seen that for Edwards and 5. Should a domain of functioning be assessed by
for Jackson the answer lies in Murray’s list a single scale or by a set of scales? If we wanted to
of needs. For Gough the answer is folk con- measure and/or understand the concept of “social
cepts. Gough argues that across the ages, in all class membership,” would simply knowing a
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 81

person’s income be sufficient, or would know- Table 4–3. The 20 Folk-Concept Scales of the
ing their educational level, their occupation, their CPI
address, their involvement in community activi- Class I scales: Measures of interpersonal style
ties, and so on, enrich our understanding? Gough
Do Dominance
argues for the latter approach.
Cs Capacity for status
Sy. Sociability
Development. The CPI, first published in 1956, Sp Social presence
originally contained 480 true-false items and 18 Sa Self-acceptance
personality scales. It was revised in 1987 to 462 In Independence
items with 20 scales. Another revision that con- Em Empathy
tains 434 items was completed in 1995; items that Class II scales: Measures of normative orientation
were out of date or medically related were elimi- Re Responsibility
nated, but the same 20 scales were retained. The So Socialization
CPI is usually presented as an example of a strictly Sc Self-control
Gi Good impression
empirical inventory, but that is not quite correct.
Cm Communality
First of all, of the 18 original scales, 5 were con- Wb Well-being
structed rationally, and 4 of these 5 were con- To Tolerance
structed using the method of internal consistency Class III scales: Measures of cognitive functioning
analysis (see Megargee, 1972, for details). Second,
Ac Achievement via
although 13 of the scales were constructed empir- conformance
ically, for many of them there was an explicit the- Ai Achievement via
oretical framework that guided the development; independence
for example, the Socialization scale came out Ie Intellectual efficiency
of a role theory framework. Finally, with the Class IV scales: Measures of personal style
1987 revision, there is now a very explicit the- Py Psychological mindedness
ory of human functioning incorporated in the Fx Flexibility
inventory. F/M Femininity/Masculinity

Description. Table 4.3 lists the names of the


current CPI scales, with a brief description of The CPI then is a personality inventory
each. designed to be taken by a “normal” adolescent
The 20 scales are arranged in four groups; these or adult person, with no time limit, but usually
groupings are the result of logical analyses and are taking 45 to 60 minutes.
intended to aid in the interpretation of the profile, In addition to the 20 standard scales, there
although the groupings are also supported by the are currently some 13 “special purpose scales”
results of factor analyses. Group I scales measure such as, for example, a “work orientation” scale
interpersonal style and orientation, and relate to (Gough, 1985) and a “creative temperament”
such aspects as self-confidence, poise, and inter- scale (Gough, 1992). Because the CPI pool of
personal skills. Group II scales relate to normative items represents an “open system,” items can be
values and orientation, to such aspects as respon- eliminated or added, and new scales developed
sibility and rule-respecting behavior. Group III as the need arises (some examples are Hogan,
scales are related to cognitive-intellectual func- 1969; Leventhal, 1966; Nichols & Schnell, 1963).
tioning. Finally, Group IV scales measure per- Because the CPI scales were developed indepen-
sonal style. dently, but using the same item pool, there is some
The basic goal of the CPI is to assess those overlap of items; 42% of the items (192 out of 462)
everyday variables that ordinary people use to load on more than one scale, with most (127 of
understand and predict their own behavior and the 192) used in scoring on two scales, and 44 of
that of others – what Gough calls folk concepts. the 192 items used on three scales.
These folk concepts are presumed to be universal, The 1987 revision of the CPI also included
found in all cultures, and therefore relevant to three “vector” or structural scales, which
both personal and interpersonal behavior. taken together generate a theoretical model of
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

82 Part Two. Dimensions of Testing

Vector 2 actual behavior of each of the four basic types


Rule accepting is also a function of the level reached on “v3”;
a delta at the lower levels may be quite mal-
ALPHA BETA adapted and enmeshed in conflicts while a delta
characterized as: characterized as: at the higher levels may be highly imaginative and
ambitious, productive, ethical, submissive, creative.
high-aspiration level, dependable and re-
leader, has social sponsible, can be con-
poise,talkative, a doer formist, methodical Administration. As with other personality
reserved
able to deal with inventories described so far, the CPI requires
able to delay little by way of administrative skills. It can be
frustration, can be self-
gratification
centered administered to one individual or to hundreds of
Vector 1 subjects at a sitting. The directions are clear and
Extraverted Introverted
the inventory can be typically completed in 45 to
(involvement and (detachment and 60 minutes. The CPI has been translated into a
participation) privacy) number of different languages, including Italian,
GAMMA DELTA
French, German, Japanese, and Mandarin
characterized as: characterized as: Chinese.
doubter and skeptic, in- tends to avoid action,
novative, self-indulgent, feels lack of personal
rebellious and noncon- meaning, shy and quiet, Scoring. The CPI can be scored manually
forming, verbally fluent reflective, focused on through the use of templates or by machine.
internal world A number of computer services are available,
Rule questioning including scoring of the standard scales, the vec-
FIGURE 4–1. The CPI vectors 1 and 2. tor scales, and a number of special purpose scales,
as well as detailed computer-generated reports,
personality. The first vector scale called “v1” describing with almost uncanny accuracy what
relates to introversion-extraversion, while the the client is like.
second vector scale, “v2,” relates to norm- The scores are plotted on a profile sheet so
accepting vs. norm-questioning behavior. A clas- that raw scores are transformed into T scores.
sification of individuals according to these two Unlike most other inventories where the listing
vectors yields a fourfold typology, as indicated in of the scales on the profile sheet is done alphabet-
Figure 4.1. ically, the CPI profile lists the scales in order of
According to this typology, people can be their psychological relationship with each other,
broadly classified into one of four types: the so that profile interpretation of the single case is
alphas who are typically leaders and doers, who facilitated. Also each scale is keyed and graphed
are action oriented, and rule respecting; the so that higher functioning scores all fall in the
betas who are also rule respecting, but are more upper portion of the profile.
reserved and benevolent; the gammas, who are
the skeptics and innovators; and finally, the deltas Reliability. Both the CPI manual (Gough, 1987)
who focus more on their own private world and and the CPI Handbook (Megargee, 1972) present
may be visionary or maladapted. considerable reliability information, too much
Finally, a third vector scale, “v3,” was devel- to be easily summarized here. But as examples,
oped with higher scores on this scale relating let us look at the Well-Being scale, one of the
to a stronger sense of self-realization and fulfill- more reliable scales, and at the Self-Acceptance
ment. These three vector scales, which are rela- scale, one of the less reliable scales. For the Well-
tively uncorrelated with each other, lead to what being scale test-retest reliability coefficients of .73
Gough (1987) calls the cuboid model. and .76 are reported, as well as internal consis-
The raw scores on “v3” can be changed into one tency coefficients ranging from .76 to .81, and
of seven different levels, from door to superior corrected split-half coefficient of .86. In contrast,
each level defined in terms of the degree of self- for the Self-Acceptance scale, the test-retest reli-
realization and fulfillment achieved. Thus the ability coefficients are .60 and .74, the internal
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 83

consistency coefficients range from .51 to .58, and an incredibly wide range of topics, from studies
the corrected split-half coefficient is .70. of academic achievement in various settings and
with various populations, to studies of criminal
Validity. There are a very large number of stud- and delinquent behavior, studies of persons in
ies that have used the CPI and thus are relevant varied occupations, creativity, intelligence, lead-
to the question of its validity. Megargee (1972) ership, life span development, and so on. Recent
attempted to summarize most of the studies that studies that have looked at the revised CPI scales
appeared before 1972, but an even larger num- have found that such scales are as valid as the
ber of studies have appeared since then. Although earlier versions and sometimes more so (e.g.,
Gough and his students have been quite prolific DeFrancesco & Taylor, 1993; Gough & Bradley,
in their contributions to the literature, the CPI 1992; Haemmerlie & Merz, 1991; Zebb & Meyers,
has found wide usage, as well as a few vociferous 1993).
critics.
Because of space limitations, we cannot even Norms. The CPI manual (Gough, 1987) contains
begin to address the issue of validity, but per- very complete norms for a wide variety of sam-
haps one small example will suffice. Over the ples, including a basic normative sample of 1,000
years, the CPI has been applied with outstand- individuals, high school samples, college sam-
ing success to a wide variety of questions of ples, graduate and professional school samples,
psychological import, including that of college occupational samples, and miscellaneous sam-
entrance. Nationwide, only about 50% of high- ples such as Catholic priests and prison inmates.
school graduates enter college. Can we predict
who will enter college? Intellectual aptitude is cer- Interesting aspects. Gough (1987) argues that
tainly one variable and indeed it correlates signifi- because all the CPI scales assess interpersonal
cantly with college entrance, but not overwhelm- functioning, positive correlations among the
ingly so; typical correlations between scores on scales should be the rule rather than the excep-
tests of intellectual aptitude and entering-not tion, and indeed are proof that the CPI is working
entering college are in the range of .30 to .40. the way it was intended to. On the other hand,
Socioeconomic status is another obvious vari- those of a factor analytic persuasion see such
able, but here the correlations are even lower. correlations as evidence that the scales are not
In the CPI test manual, Gough (1987) reports pure measures. The data presented in the manual
on a nationwide normative study in which 2,620 (Gough, 1987) do indeed show that the 20 folk-
students took the CPI while in high school concept scales intercorrelate, some quite sub-
and were surveyed 5 to 10 years later as to stantially and some to an insignificant degree.
their college-going. Overall, 40% of the sample For example, at the high end, Tolerance and
attended college, but the rates were different for Achievement by Independence correlate +.81,
each of the four types, as defined by vectors 1 while Dominance and Self-Acceptance correlate
and 2. Alphas had the highest rate (62%), while +.72. At the low end, Flexibility and Reliability
deltas had the lowest rate (23%); both betas and correlate +.05 and Femininity-Masculinity and
gammas had rates of 37%. High potential alphas Good Impression correlate +.02 (these coeffi-
(those scoring at levels 5, 6, or 7 on the “v3” cients are based on a sample of 1,000 males).
scale) tended to major in business, engineer- Given the 20 folk-concept scales and the fact
ing, medicine, and education, while high poten- that they intercorrelate, we can ask whether there
tial deltas tended to major in art, literature, and are fewer dimensions on the CPI than the 20
music; note that because fewer deltas were enter- represented by the scales, and indeed there are.
ing college, there are fewer such talented persons Gough (1987) presents the results of a factor anal-
in a college environment. Within each type, going ysis based on 1,000 males and 1,000 females, that
to college was also significantly related to level of indicates four factors:
self-realization. For example, for the alphas only
28% of those in level 1 went to college, but a full 1. The first factor is named extraversion and
78% of those in levels 5, 6, and 7 did. As Gough involves scales that assess poise, self-assurance,
(1989) points out, the CPI has been applied to initiative, and resourcefulness.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

84 Part Two. Dimensions of Testing

2. The second factor is one of control, and is Table 4–4. The world according to erikson
defined by scales that relate to social values and
Life stage Challenge to be met
the acceptance of rules.
Early infancy trust vs. mistrust
3. Factor 3 is called flexibility, and is defined
Later infancy autonomy vs. shame and
by scales that assess individuality, ingenuity, and doubt
personal complexity. Early childhood initiative vs. guilt
4. Finally, the fourth factor is called consensual- Middle childhood industry vs. inferiority
Adolescence identity vs. role confusion
ity, and is defined by scales that assess the degree
Early adulthood intimacy vs. isolation
to which a person sees the world as others do and Middle adulthood generativity vs.
behaves in accord with generally accepted prin- stagnation
ciples, with what is accepted by consensus. Late adulthood ego integrity vs. despair

A more recent factor analysis of the CPI-R


(Wallbrown & Jones, 1992), gives support both of 346 items reflecting both positive and negative
to the notion that there is one general factor of aspects of each stage. Unlike most other person-
personal adjustment measured by the CPI, as well ality inventories that use a true-false format, the
as three additional factors that coincide well with response format chosen was a 5-point scale, more
Gough’s clinical analysis of the three vectors. formally known as a Likert scale (see Chapter 6),
Much more can be said about the CPI. Its which gives the respondent five response choices:
manual contains much information aimed at the strongly agree, agree, uncertain, disagree, and
practitioner, including case reports. The CPI has strongly disagree. Each of the items was first pre-
found wide usage not just as a research instru- sented to five psychologists familiar with Erik-
ment, but for career counseling (e.g., McAllister, son’s theory, who were asked to review the item
1986) and organizational planning (e.g., P. Meyer for clarity of meaning, and were asked to identify
& Davis, 1992). We should also mention that which life stage did the item address. Items that
three of the CPI scales are designed to detect were judged not to be clear or were not identi-
invalid protocols, in addition to having persono- fied correctly as to stage were eliminated. Those
logical implications. Perhaps more than any other procedures left 208 items. These items were then
personality inventory, the CPI has been used in a administered to various samples, ranging from
wide variety of cross-cultural studies. high-school students to adults living in a retire-
We now look at some personality scales that ment community – a total of 528 subjects. Each
are not well known or commercially available, person was also asked to complete a question-
such as the inventories discussed so far. They are, naire that asked the respondent to rate on a scale
however, illustrative of what is currently available of 0 to 100% how successfully he or she had met
in the literature and of various approaches. each of 19 life challenges, such as trusting others,
having sufficient food, and being independent.
Eight of these 19 challenges represented those in
The Inventory of Psychosocial
Erikson’s life stages.
Balance (IPB)
The 528 protocols were submitted to a factor
Introduction. The IPB is based upon the devel- analysis, and each item was correlated with each
opmental theory of Erik Erikson (1963; 1980; of the eight self-ratings of life challenges. The
1982) who postulated that life is composed of factor analysis indicated eight meaningful factors
eight stages, each stage having a central challenge corresponding to the eight stages. Items for each
to be met. The eight stages and their respective of the eight scales were retained if they met three
challenges are presented in Table 4.4. criteria:

Development. G. Domino and Affonso (1990) 1. The item should correlate the highest with its
developed the IPB to assess these eight stages. appropriate dimension – for example, a trust item
They began by analyzing Erikson’s writings and should correlate the most with the trust dimen-
the related literature, and writing an initial pool sion.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 85

Table 4–5. IPB factors and representative content. The test-retest coefficients for the sec-
items ond sample ranged from .79 to .90, quite high
Factor Representative item
and indicative of substantial temporal stability,
at least over a 1-month period.
Trust I can usually depend on others
Autonomy I am quite self-sufficient
Initiative When faced with a problem, I Validity. The validity of a multivariate instru-
am very good at developing ment is a complex endeavor, but there is some
various solutions available evidence in a set of four studies by
Industry I genuinely enjoy work G. Domino and Affonso (1990). In the first study,
Identity Sometimes I wonder who I
IPB scores for a sample of 57 adults were corre-
really am
Intimacy I often feel lonely even when lated with an index of social maturity derived
there are others around me from the CPI (Gough, 1966). Six of the eight IPB
Generativity Planning for future generations scales correlated significantly and positively with
is very important the CPI social maturity index. Individuals who
Ego integrity Life has been good to me
are more mature socially tend to have achieved
the Eriksonian developmental goals to a greater
2. The item should correlate the most with degree. The two scales that showed nonsignifi-
corresponding self-ratings. A trust item should cant correlation coefficients were the Autonomy
correlate the most with the self-rating of trusting scale and the Intimacy scale.
others. In a second study, 166 female college students
3. The obtained correlation coefficients in each were administered the IPB, their scores summed
case must be statistically significant. across the eight scales, and the 18 highest scor-
ing and 18 lowest scoring students were then
Finally, for each scale, the best 15 items were assessed by interviewers, who were blind as to the
selected, with both positively and negatively selection procedure. The high IPB scorers were
worded items to control for any response bias. seen as independent, productive, socially at ease,
warm, calm and relaxed, genuinely dependable
Description. The IPB is brief, with 120 items, and responsible. The low IPB scorers were seen as
and consists of a question sheet and a separate self-defensive, anxious, irritable, keeping people
answer sheet. It is designed for adults, although it at a distance, and self-dramatizing. In sum, the
may be appropriate for adolescents as well. high scorers were seen as psychologically healthy
Table 4.5 gives the eight scales with some rep- people, while the low scorers were not. Inciden-
resentative items for each scale. tally, this study nicely illustrates part of secondary
validity, we discussed in Chapter 3.
Administration. The IPB can be easily admin- You will recall also from Chapter 3, that to
istered to an individual or a group, with most establish construct validity, both convergent and
subjects completing the instrument in less than discriminant validity must be shown. The first
30 minutes. two studies summarized above, speak to the con-
vergent validity of the IPB; a third study was car-
Scoring. The eight scales can be easily scored by ried out to focus on discriminant validity. For a
hand. sample of 83 adults, the IPB was administered
together with a set of scales to measure variables
Reliability. The authors assessed three samples such as social desirability and intelligence. A high
for reliability purposes: 102 college students; 68 correlation between an IPB scale and one of these
community adults who were administered the scales might suggest that there is a nuisance com-
IPB twice with a test-retest period from 28 to ponent, that the scale in fact does not assess the
35 days, and a third sample of 73 adults living in relevant stage but is heavily influenced by, for
a retirement community. The alpha coefficients example, intelligence. In fact, of the 48 correla-
for the first and third samples ranged from .48 to tions computed, only one achieved statistical sig-
.79, acceptable but low. The authors interpreted nificance even although quite low (.29), and thus
these results as reflecting heterogeneity of item quite easily due to chance.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

86 Part Two. Dimensions of Testing

Finally, a fourth study is presented by the analyzing their own thoughts and those of oth-
authors to show that within the IPB there are ers, as well as individuals who seem to be blessedly
developmental trends in accord with Erikson’s ignorant of their own motivation and the impact,
theory. For example, adolescents should score or lack of it, they have on others.
lower than the elderly, and the results partially
support this. Development. Fenigstein, Scheier, and Buss
(1975) set about to develop a scale to measure
Norms. Formal norms are not presently available such self-consciousness, which they defined as the
on the IPB other than summary statistics for the consistent tendency of a person to direct his or her
above samples. attention inwardly or outwardly. They first iden-
tified the behaviors that constitute the domain of
Interesting aspects. In a separate study (G. self-consciousness, and decided that this domain
Domino & Hannah, 1989), the IPB was adminis- was defined by seven aspects: (1) preoccupation
tered to 143 elderly persons who were participat- with past, present, and future behavior; (2) sen-
ing in a college program. They were also assessed sitivity to inner feelings; (3) recognition of one’s
with the CPI self-realization scale (vector 3) as a personal attributes, both positive and negative;
global self-report of perceived effective function- (4) the ability to “introspect” or look inwardly;
ing. For men, higher effective functioning was (5) a tendency to imagine oneself; (6) awareness
related to a greater sense of trust and industry and of one’s physical appearance; and (7) concern
lower scores on generativity and intimacy. For about the appraisal of others.
women, higher effective functioning was related This theoretical structure guided the writing of
most to a sense of identity and to lower scores on 38 items, with responses ranging from extremely
trust and industry. These results suggest that for uncharacteristic (scored zero) to extremely char-
people who grew up in the 1920s and 1930s, there acteristic (scored 4 points). These items were
were different pathways to success – for men suc- administered to undergraduate college students,
cess was facilitated by having basic trust, work- 130 women and 82 men, whose responses were
ing hard, and not getting very close to others. then factor analyzed. The results indicated three
For women, it meant developing a strong sense factors. This set of items was then revised a num-
of identity, not trusting others, and not being ber of times, each time followed by a factor anal-
as concerned with actual work output (see also ysis, and each time a three-factor structure was
Hannah, G. Domino, Figueredo, & Hendrickson, obtained.
1996).
Note that in developing the IPB the authors Description. The final version of the SCI consists
attempted to develop scales on the basis of both of 23 items, with 10 items for factor 1 labeled pri-
internal consistency and external validity. vate self-consciousness, 7 items for factor 2 labeled
public self-consciousness, and 6 items for factor
Criticisms. The IPB is a new instrument, and 3 labeled social anxiety. The actual items and
like hundreds of other instruments that are pub- their factor loadings are presented in the article
lished each year, may not survive rigorous anal- by Fenigstein, Scheier, and Buss (1975). Exam-
ysis, or may simply languish on the library ples of similar items are, for factor 1: “I am very
shelves. aware of my mood swings”; for factor 2: “I like
to impress others”; for factor 3: “I am uneasy in
large groups.”
The Self-Consciousness Inventory (SCI)
Introduction. “Getting in touch with oneself ” or Administration. This is a brief instrument easily
self-insight would seem to be an important vari- self-administered, and probably taking no longer
able, not just from the viewpoint of the psychol- than 15 minutes for the average person.
ogist interested in the arena of psychotherapy,
for example, but also for the lay person involved Scoring. Four scores are obtained, one for each
in everyday transactions with the world. We all of the three factors, and a total score which is the
know individuals who almost seem obsessed with sum of the three factor scores.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 87

Reliability. The test-retest reliability for a sam- easily carried out by computer. Logically, this
ple of 84 subjects over a two-week interval ranges procedure makes sense. If an item measures
from +.73 for Social Anxiety (the shortest scale) more of a particular dimension, as shown by its
to +.84 for Public Self-consciousness. The reli- larger factor loading, shouldn’t that item be given
ability of the total score is +.80. Note here, that greater weight? Empirically, however, this proce-
the total scale, which is longer than any of its dure of differential weighting does not seem to
subscales, is not necessarily the most reliable. improve the validity of a scale. Various attempts
have been made in the literature to compare var-
Validity. No direct validity evidence was pre- ious ways of scoring the same instrument, to
sented in the original paper, but subsequent determine whether one method is better. For
research supports its construct validity (e.g., Buss an example of a study that compared linear vs.
& Scheier, 1976; L. C. Miller, Murphy, & Buss, nonlinear methods of combining data see C. E.
1981). Lunneborg and P. W. Lunneborg (1967).

Criticisms. The initial pool of items was surpris-


Norms. The authors present means and SDs sep-
ingly small, especially in relation to the number
arately for college men (n = 179) and for college
of items that were retained, and so it is natural to
women (n = 253), both for the total scale and for
wonder about the content validity of this test.
the three subscales. The results seem to indicate
no gender differences.
Boredom Proneness Scale (BP)
Interesting aspects. Interscale correlations are
Introduction. The authors of this scale (Farmer
presented by the authors. The coefficients are
& Sundberg, 1986), argue that boredom is a com-
small (from −.06 to +.26), but some are statisti-
mon emotion and one that is important not only
cally significant. Thus public self-consciousness
in the overall field of psychology but also in more
correlates moderately with both private self-
specialized fields such as industrial psychology,
consciousness and social anxiety, while private
education, and drug abuse, yet few scales exist to
self-consciousness does not correlate signifi-
measure this important variable.
cantly with social anxiety.
Note that the three factors do not match the
seven dimensions originally postulated, and the Development. The authors began with a review
authors do not indicate the relationship between of the relevant literature, as well as with inter-
obtained factors and hypothesized dimensions. views with various persons; this led to a pool of
Note also that the three subscales are scored 200 true-false items, similar to, “I am always busy
by unitary weights; that is, each item is scored with different projects.” Items that were dupli-
0 to 4 depending on the keyed response that is cates and items for which three out of four judges
endorsed. This is not only legitimate, but a quite could not agree on the direction of scoring were
common procedure. There is however, at least eliminated. Preliminary scales were then assessed
one alternative scoring procedure and that is to in various pilot studies and items revised a num-
assign scoring weights on the basis of the fac- ber of times.
tor loadings of the items, so that items that have
a greater factor loading, and presumably mea- Description. The current version of the scale
sure “more” of that dimension, receive greater contains 28 items (listed in Farmer & Sund-
weight. For example, item 1 has a factor loading berg, 1986), retained on the basis of the follow-
of .65 for factor 1, and could be scored .65 times 0 ing criteria: (1) responses on the item correlated
to 4, depending on the response choice selected. with the total score at least +.20; (2) at least 10%
Item 5 has a loading of .73 for factor 1, and so of the sample answered an item in the “bored”
could be scored .73 times 0 to 4, giving it a greater direction; (3) a minimal test-retest correlation of
weight than item 1 in the subscale score. Clearly, +.20 (no time interval specified); and (4) a larger
this scoring procedure would be time consuming correlation with the total score than with either
if the scoring were done by hand, but could be of two depression scales; depression was chosen
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

88 Part Two. Dimensions of Testing

because the variables of boredom and depression Criticisms. This seems like a useful measure that
overlap but are seen as distinct. was developed in a careful and standard manner.

Administration. This scale is easily self-


administered and has no time limit; most THE BIG FIVE
subjects should be able to finish in less than
15 minutes. We must now return to the basic question we
asked at the beginning of the chapter – how many
Scoring. The scale is hand-scored; the score rep- dimensions of personality are there? We have seen
resents the number of items endorsed in the keyed that different investigators give different answers.
direction. The Greeks postulated four basic dimensions. Sir
Francis Galton (1884) estimated that the English
Reliability. Kuder-Richardson 20 reliability for a language contained a “thousand words” reflec-
sample of 233 undergraduates was +.79. Test- tive of character. McDougall (1932) wrote that
retest reliability for 28 males and 34 females, personality could be broadly analyzed into five
over a 1-week period, was +.83. Thus, this scale separate factors, that he named intellect, charac-
appears to be both internally consistent and sta- ter, temperament, disposition, and temper. Thur-
ble over a 1-week period. stone (1934), another pioneer psychologist espe-
cially in the field of factor analysis, used a list
Validity. In a sample of 222 college undergradu- of 60 adjectives and had 1,300 raters describe
ates, scores on the BPS correlated +.67 with two someone they knew well using the list. A factor
boredom self-rating items scored on a 5-point analysis of the ratings indicated five basic fac-
scale, from never to most of the time. Essentially, tors. Allport and Odbert (1936) instead found
this represents the correlation between one T-F that the English language contained some 18,000
scale of 28 items and one 5-point scale of 2 items. descriptive terms related to personality. Stud-
In a second study, BPS scores were correlated ies conducted at the University of Minnesota in
with students’ ratings of whether a lecture and its the 1940s yielded an item pool of 84 categories
topic were boring. Most of the correlations were (Gough, 1991). Meehl, Lykken, Schofield, and
low but significant (in the .20s). BPS scores also Tellegen (1971) in a study of therapists ratings of
correlated significantly (r = +.49) with another their psychiatric patients found 40 factors. Cat-
scale of boredom susceptibility, and a scale of tell considers his 16 dimensions primary traits,
job boredom (r = +.25). At the same time, BPS although there are other primary traits in the
scores correlated substantially with measures of background, as well as secondary traits that seem
depression (.44 and .54), with a measure of hope- just as important. Edwards considered 15 needs
lessness (.41), and a measure of loneliness (.53). to be important, while Jackson using the same
These findings are in line with the observation theory scaled 15 or 22 needs depending upon
that the bored individual experiences varying the test form. Gough on the other hand, prefers
degrees of depression, of hopelessness, and of the idea of an open system that allows the num-
loneliness. ber to be flexible and to be tied to the needs of
applied settings. Many other examples could be
Norms. Formal norms on this scale are not avail- listed here. In one sense we can dismiss the ques-
able in the literature. tion as basically an ivory tower exercise – whether
the continental United States has 48 states, six
Interesting aspects. Note that the development regional areas, 250 major census tracts, or other
of this scale follows the steps we outlined earlier. geopolitical divisions, does not make much dif-
The scale is intended to be internally homoge- ference, and depends upon one’s purposes. But
neous, but a factor analysis has not been carried the search for the number of basic dimensions,
out. The significant correlations with depression, like the search for Bigfoot, goes on.
hopelessness, and loneliness could be seen as a One answer that has found substantial favor
“nuisance” or as a reflection of the real world, and support in the literature is that there are
depending on one’s philosophy of testing. five basic dimensions, collectively known as the
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 89

Table 4–6. The five-factor model


Factor (alternative names) Definition
1. Neuroticism Maladjustment, worrying and insecure, depressed vs.
(emotional stability; adjustment) adjustment, calm and secure
2. Extraversion-Introversion Sociable and affectionate vs. retiring and reserved
(surgency)
3. Openness to experience Imaginative and independent vs. practical and conforming
(intellect; culture)
4. Agreeableness Trusting and helpful, good natured, cooperative vs.
(likability; friendliness) suspicious and uncooperative
5. Conscientiousness Well organized and careful vs. disorganized and careless
(dependability; conformity)

“Big Five.” One of the first to point to five basic The NEO Personality Inventory-Revised
dimensions were Tupes and Christal (1961) and (NEO-PI-R)
Norman (1963), although the popularity of this
Introduction. As the name indicates, this inven-
model is mostly due to the work of Costa and
tory originally was designed to measure three per-
McRae who have pursued a vigorous program of
sonality dimensions: neuroticism, extraversion,
research to test the validity and utility of this five-
and openness to experience (Costa & McCrae,
factor model (e.g., McCrae & Costa, 1983b; 1987;
1980). Eventually two additional scales, agree-
1989b; McCrae, Costa & Busch, 1986).
ableness and conscientiousness, were added to
There seems to be general agreement as to the
bring the inventory into line with the Big-Five
nature of the first three dimensions, but less so
model (Costa & McCrae, 1985). Finally, in 1990
with the last two. Table 4.6 gives a description of
the current revised edition was published (Costa
these dimensions.
& McCrae, 1992).
A number of researchers have reported results
consonant with a five-factor model that attest
to its theoretical “robustness,” degree of gener- Development. The original NEO inventory,
alizability, and cross-cultural applicability (e.g., published in 1978, was made up of 144 items
Barrick & Mount, 1991; Borgatta, 1964; Digman, developed through factor analysis to fit a three-
1989; 1990; Digman & Inouye, 1986; Digman & dimensional model of personality. The test was
Takemoto-Chock, 1981; Goldberg, 1990; Osten- developed primarily by the rational approach,
dorf, 1990 [cited by Wiggins & Pincus, 1992]; with the use of factor analysis and related tech-
Watson, 1989); but some studies do not sup- niques to maximize the internal structure of
port the validity of the five factor model (e.g., the scales. Despite the use of such techniques,
H. Livneh & C. Livneh, 1989). the emphasis of the authors has been on con-
Note that the five-factor model is a descrip- vergent and discriminant validity coefficients,
tive model. The five dimensions need not occur that is, external criteria rather than internal
in any particular order, so that no structure is homogeneity.
implied. It is a model rather than a theory, and to The measures of agreeableness and conscien-
that extent it is limited. In fact, McCrae and Costa tiousness were developed by first creating two 24-
(1989b) indicate that the five-factor model is not item scales, based on a rational approach. Then
to be considered a replacement for other person- the scales were factor analyzed, along with the
ality systems, but as a framework for interpreting NEO inventory. This resulted in 10 items to mea-
them. Similarly, they write that measuring the sure the two dimensions, although it is not clear
big five factors should be only the first step in whether there were 10 items per dimension or
undertaking personality assessment. In line with 10 total (McCrae & Costa, 1987). A revised test
their model, Costa and McCrae (1980; 1985) have was then constructed that included the 10 items,
presented an inventory to measure these five basic plus an additional 50 items intended to measure
dimensions, and we now turn to this inventory agreeableness and conscientiousness. An item
as a final example. analysis yielded two 18-item scales to measure the
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

90 Part Two. Dimensions of Testing

two dimensions, but inexplicably the two final cates that domain scores give a good approxima-
scales consisted of 10 items to measure agree- tion of the factor scores, and so it is not worth
ableness and 14 items to measure conscientious- calculating factor scores by hand for individual
ness. If the above seems confusing to you, you’re cases.
in good company! In the current version of the
NEO-PI-R each of the five domain scales is made Reliability. Internal consistency and 6-month
up of six “facets” or subscales, with each facet test-retest reliability coefficients for the first three
made up of eight items, so the inventory is com- (NEO) scales are reported to be from +.85 to
posed of a total of 240 items. The keyed response +.93 (McCrae & Costa, 1987). The test man-
is balanced to control for acquiescence. There are ual (Costa & McCrae, 1992) reports both alpha
then five major scales, called domain scales, and coefficients and test-retest reliability coefficients,
30 subscales, called facet scales. and these seem quite satisfactory. Caruso (2000)
reported a metaanalysis of 51 studies dealing with
Description. There are two versions of the NEO- the reliability of the NEO personality scales, and
PI-R. Form S is the self-report form with items found that reliability was dependent on the spe-
answered on a 5-point Likert scale from “strongly cific NEO dimension-specifically Agreeableness
disagree” to “strongly agree.” Form R is a com- scores were the weakest, particularly in clinical
panion instrument for observer ratings, with samples, for male only samples, and with test-
items written in the third person, for use by retest reliability.
spouse, peer, or expert ratings (McCrae, 1982).
An abbreviated version of the NEO-PI-R is
Validity. Much of the research using the NEO-PI
also available consisting of 60 items, and yield-
and leading to the development of the NEO-PI-
ing scores for the five domains only (Costa &
R is based on two major longitudinal studies of
McCrae, 1989). Like most commercially pub-
large samples, one of over 2,000 white male vet-
lished personality inventories, the NEO-PI-R
erans, and the other based on a variable num-
uses a reusable test booklet, and separate answer
ber sample of volunteers participating in a study
sheet that may be machine or hand scored. The
of aging. Both the test manual and the litera-
NEO-PI-R is intended for use throughout the
ture are replete with studies that in one way or
adult age range (see Costa & McCrae, 1992 for
another address the validity of the NEO-PI and
a discussion of the applicability of the NEO-PI-R
the NEO-PI-R, including content, criterion, and
to clinical clients).
construct validity. Because we are considering 35
scales, it is impossible to meaningfully summa-
Administration. As with all other personality
rize such results, but in general the results sup-
inventories, the NEO-PI-R is easy to administer,
port the validity of the NEO-PI-R, especially its
has no time limit, and can be administered to one
domain scales.
person or to many. It can be computer admin-
istered, scored, and interpreted, hand scored or
machine scored, and professional scoring and Norms. The test manual (Costa & McCrae, 1992)
interpretation services are available from the gives a table of means and SDs for men and
publisher. women separately, based on samples of 500 men
and 500 women. There is a similar table for
Scoring. Because of the subscales, hand scoring college-aged individuals, based on a sample of
can be tedious. Raw scores are first calculated 148 men and 241 women aged 17 through 20
for all 30 facet scales and 5 domain scales. These years. Tables are also available to change raw
scores are then plotted on profile sheets that are scores into percentiles.
separately normed for men and women. Plotting
converts the raw scores into T scores. However, Interesting aspects. The literature seems to con-
the T scores are then used to calculate domain fuse the Big-Five model with the NEO-PI.
factor scores. Each factor score involves adding Although all the evidence points to the usefulness
(or subtracting) some 30 components (the facet of the five-factor model, whether the NEO-PI-R
scores), a horrendous procedure if done by hand. is the best measure of the five factors is at present
In fact, the manual (Costa & McCrae, 1992) indi- an open question.
P1: JZP
0521861810c04 CB1038/Domino 0 521 86181 0 February 22, 2006 11:47

Personality 91

We can once again ask whether the five dimen- In this study, Broughton examines six strategies by which
sions, as assessed by the NEO-PI-R are indepen- personality scales can be constructed, including a “prototype”
strategy not commonly used.
dent. The test manual gives a table of intercorre-
lations among the 35 scales that indicates the five Burisch, M. (1984). Approaches to personality inven-
domain scales not to be all that independent; for tory construction. American Psychologist, 39, 214–
example, scores on the Neuroticism scale corre- 227.
late −.53 with scores on the Conscientiousness A very readable article in which the author discusses three
scale, and scores on the Extraversion scale corre- major approaches to personality scale construction, which he
labels as external, inductive, and deductive. The author argues
late +.40 with scores on the Openness scale. In that although one method does not appear to be better, the
addition, the facet scales under each domain scale deductive approach is recommended.
intercorrelate substantially. For example, Anxi-
Jung, C. G. (1910). The association method. American
ety and Depression, which are facets of the Neu- Journal of Psychology, 21, 219–235.
roticism scale, correlate +.64 with each other.
Jung is of course a well-known name, an early student of Freud
Although one would expect the components of who became an internationally known psychiatrist. Here he
a scale to intercorrelate significantly, a substan- presents the word association method, including its use to
tial correlation brings into question whether the solve a minor crime. Although this method is considered a
components are really different from each other. projective technique rather than an objective test, the histor-
ical nature of this paper makes it appropriate reading for this
chapter.
Criticisms. Hogan (1989b) in reviewing the
NEO-PI commends it highly because it was devel- Kelly, E. J. (1985). The personality of chessplayers. Jour-
nal of Personality Assessment, 49, 282–284.
oped and validated on adult subjects rather than
college students or mentally ill patients, because A brief but interesting study of the MBTI responses of chess-
players. As you might predict, chessplayers are more intro-
it represents an attempt to measure the Big-Five verted, intuitive, and thinking types than the general popu-
dimensions, and because there is good discrimi- lation.
nant and convergent validity. Clearly, the NEO-
McCrae, R. R. & John, O. P. (1992). An introduction
PI has made an impact on the research literature to the five-factor model and its applications. Journal of
and is beginning to be used in a cross-cultural Personality, 60, 175–215.
context (e.g., Yank et al., 1999). Whether it can
A very readable article on the Big-Five model, its nature and
be useful in understanding the individual client history.
in counseling and therapeutic settings remains to
be seen.
DISCUSSION QUESTIONS
SUMMARY
1. Do you think that most people answer hon-
In this chapter, we have looked at a variety of mea- estly when they take a personality test?
sures of personality. Most have been personality 2. Compare and contrast the Cattell 16 PF and
inventories, made up of a number of scales, that the California Psychological Inventory.
are widely used and commercially available. A few 3. The EPPS covers 15 needs that are listed in
are not widely known but still are useful teach- Table 4.2. Are there any other needs impor-
ing devices, and they illustrate the wide range of tant enough that should be included in this
instruments available and the variables that have inventory?
been scaled. 4. How might you go about generating some evi-
dence for the validity of the Self-Consciousness
SUGGESTED READINGS Inventory?
Broughton, R. (1984). A prototype strategy for con- 5. How can the criterion validity of a personality
struction of personality scales. Journal of Personality measure of “ego strength” (or other dimension)
and Social Psychology, 47, 1334–1346. be established?
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

5 Cognition

AIM In this chapter we focus on the assessment of cognitive abilities, primarily intel-
ligence. We take a brief look at various basic issues, some theories, and some repre-
sentative instruments. We see that the assessment of intelligence is in a state of flux,
partly because of and partly parallel to the changes that are taking place in the field
of cognitive psychology.

INTRODUCTION and one concrete sign of it is the current shift


from a more product orientation to a more pro-
If you thought personality was difficult to define cess orientation. In the past, prediction of aca-
and a topic filled with questions for which there demic success was a major criterion both in the
are no agreed-upon answers, then cognition, and construction of intelligence tests by, for exam-
more specifically intelligence, is an even more ple, retaining items that correlated significantly
convoluted topic. with some index of academic achievement such
Not only is there no agreed-upon definition as grades and in the interpretation of those
of intelligence, but the discoveries and find- test results, which emphasized the child’s IQ as
ings of cognitive psychology are coming so fast a predictor of subsequent school performance.
that any snapshot of the field would be out- Currently, the emphasis seems to be more on
dated even before it is developed. Fortunately theory, and in the development and utilization
for textbook writers, the field of testing is in of cognitive tests that are more closely related to
many ways slow-moving, and practitioners do a theoretical model, both in their development
not readily embrace new instruments, so much of and in their utilization (Das, Naglieri, & Kirby,
what is covered in this chapter will not be readily 1994). This should not be surprising, given our
outdated. earlier discussion of the current importance of
In the field of intelligence, a multitude of the- construct validity.
oretical systems compete with each other, great
debate exists about the limits that heredity and
environment impose upon intelligence as well Some basic thoughts. Most individuals think of
as substantial argument as to whether intelli- intelligence as an ability or set of abilities, thus
gence is unitary or composed of multiple pro- implying that intelligence is composed of stable
cesses (A. S. Kaufman, 1990; Sternberg, 1985; characteristics, very much like the idea of traits
1988a; Wolman, 1985). It is somewhat of a para- that we discussed in defining personality. Most
dox that despite all the turbulent arguments and likely, these abilities would include the ability to
differing viewpoints, the testing of intelligence reason, to solve problems, to cope with new situ-
is currently dominated basically by two tests: ations, to learn, to remember and apply what one
the Stanford-Binet and the Wechsler series. Very has learned, and perhaps the ability to solve new
clearly, however, there is a revolution brewing, challenges quickly.

92
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 93

Probably most people would also agree that spoke of intelligence as within the individual,
intelligence, or at least intelligent behavior, can others within the environment, and still others
be observed and perhaps assessed or measured. as an interaction between the individual and the
Some psychologists would likely add that intel- environment. Even among those who defined the
ligence refers to the behavior rather than to the locus of intelligence as the individual, there were
person – otherwise we would be forced to agree those who were more concerned with biological
with circular statements such as, “Johnny is good aspects, others with processes such as cognition
at solving problems because he is intelligent,” and motivation, and still others with observable
rather than the more circumscribed observation behavior. Although we have made tremendous
that “Johnny is solving this problem in an intel- leaps since 1921 in our understanding of intel-
ligent manner.” Perhaps as a basic starting point ligence and in the technical sophistication with
we can consider intelligence tests as measures of which we measure cognitive functioning, we are
achievement, of what a person has learned over still hotly debating some of the very same basic
his or her lifetime within a specific culture; this issues. (See Neisser et al., 1996, for an overview
is in contradistinction to the more typical test of these issues.)
of achievement that assesses what the individual
has learned in a specific time frame – a semester Intelligence: global or multiple. One of the
course in introductory algebra, or basic math basic questions directly related to the testing of
learned in primary grades. intelligence, is whether intelligence is a global
capacity, similar to “good health,” or whether
Some basic questions. One basic question con- intelligence can be differentiated into various
cerns the nature of intelligence. To what extent dimensions that might be called factors or apti-
is intelligence genetically encoded? Are geniuses tudes, or whether there are a number of differ-
born that way? Or can intelligence be increased ent intelligences (Detterman, 1992; H. Gardner,
or decreased through educational opportuni- 1983). One type of answer is that intelligence
ties, good parental models, nutrition, and so is what we make of it, that our definition may
on. Or are there complex interactions between be appropriate for some purposes and not for
the nature and the nurture sides of this ques- others. After all, the concept of “good health” is
tion, so that intellectual behavior is a reflection quite appropriate for everyday conversation, but
of the two aspects? Another basic question con- will not do for the internist who must look at
cerns the stability over time of cognitive abili- the patient in terms both of overlapping systems
ties. Do intelligent children grow up to be intel- (respiratory, cardiovascular, etc.), and specific
ligent adults? Do cognitive abilities decline with syndromes (asthma, diabetes, etc.).
age? Another basic issue is how cognitive abilities The early intelligence tests, especially the
interact with other aspects of functioning such as Binet-Simon, were designed to yield a single,
motivation, curiosity, initiative, work habits, per- global measure representing the person’s gen-
sonality aspects, and other variables. Still another eral cognitive developmental level. Subsequent
basic question is whether there are gender dif- tests, such as the Wechsler series, while provid-
ferences. For example, do females perform bet- ing such a global measure, also began to separate
ter on verbal tasks and males, better on quanti- cognitive development into verbal and perfor-
tative, mathematical tasks? (Maccoby & Jacklin, mance areas, and each of these areas was fur-
1974). ther subdivided. A number of multiple aptitude
There are indeed lots of intriguing questions batteries were developed to assess various com-
that can be asked, and lots of different answers. ponents that were either part of intelligence tests
Way back in 1921, the editors of the Journal of but were represented by too few items, or that
Educational Psychology asked a number of promi- were relatively neglected, such as mechanical abil-
nent psychologists to address the issue of what is ities. Finally, a number of tests designed to assess
intelligence. Recently, Sternberg and Detterman specific cognitive aptitudes were developed.
(1986) repeated the request of some 24 experts The progression from global intelligence test
in the field of intelligence. In both cases, there to a specification and assessment of individual
was a diversity of viewpoints. Some psychologists components was the result of many trends. For
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

94 Part Two. Dimensions of Testing

one, the development of factor analysis led to So the focus here is on how people go about solv-
the assessment of intelligence tests and the iden- ing problems, on processing information, rather
tification of specific components of such tests. than on why Johnny does better than Billy. Repre-
Practical needs in career counseling, the place- sentative theories are those of Baron (1985), A. L.
ment of military personnel into various service Brown (1978), and Sternberg (1985). Many of the
branches, and the application of tests in indus- tests that have evolved from this approach assess
trial settings led to the realization that a global very specific processes such as “letter match-
measure was highly limited in usefulness, and ing.” Although some of these tests are compo-
that better success could be attained by the use nents of typical intelligence tests, most are used
of tests and batteries that were more focused on for research purposes rather than for individual
various specialized dimensions such as form per- assessment.
ception, numerical aptitude, manual dexterity,
3. The biological metaphor. Here intelligence is
paragraph comprehension, and so on. (For an
defined in terms of brain functions. Sternberg
excellent review of the measurement of intelli-
(1990) suggests that these theories are based or
gence see Carroll, 1982.)
supported by three types of data: (1) studies of the
localization of specific abilities in specific brain
THEORIES OF INTELLIGENCE sites, often with patients who have sustained
some type of brain injury; (2) electrophysiolog-
The six metaphors. Because intelligence is such ical studies where the electrical activity of the
a fascinating, and in many ways, central topic brain is assessed and related to various intellec-
for psychology, there are all sorts of theories tual activities such as test scores on an intelligence
and speculations about the nature of intelligence, test; and (3) the measurement of blood flow in the
and many disagreements about basic definitional brain during cognitive processing, especially to
issues. Sternberg (1990) suggests that one way to localize in what part of the brain different pro-
understand theories of intelligence is to catego- cesses take place. Representative theories here are
rize them according to the metaphor they use – those of Das, Kirby, and Jarman (1979) and Luria
that is, the model of intelligence that is used to (1973). This approach is reflected in some tests of
build the theory. He suggests that there are six intelligence, specifically the Kaufman Assessment
such metaphors or models: Battery for Children, and in neuropsychological
1. The geographic metaphor. These theories, batteries designed to assess brain functioning (see
those of individuals including Spearman, Thur- Chapter 15).
stone, and Guilford, attempt to provide a map 4. The epistemological metaphor. The word “epis-
of the mind. They typically attempt to identify temology” refers to the philosophical study of
the major features of intelligence, namely factors, knowledge, so this model is one that looks pri-
and try to assess individual differences on these marily at philosophical conceptions for its under-
factors. They may also be interested in determin- pinnings. This model is best represented by
ing how the mental map changes with age, and the work of the Swiss psychologist, Jean Piaget
how features of the mental map are related to (1952). His theory is that intellectual develop-
real life criteria. The focus of these theories is ment proceeds through four discrete periods:
primarily on structure rather than process; like (1) a sensorimotor period, from birth to 2 years,
the blueprint of a house, they help us understand whose focus is on direct perception; (2) a pre-
how the structure is constructed but not neces- operational period, ages 2 to 7, where the child
sarily what takes place in it. Currently, most tests begins to represent the world through symbols
of intelligence are related to, or come from, these and images; (3) a concrete operations period, ages
geographic theories. 7 to 11, where the child can now perform oper-
2. The computational metaphor. These theories ations on objects that are physically present and
see the intellect or the mind as a computer. The therefore “concrete”; and (4) formal operations,
focus here is on the process, on the “software,” which begins at around age 11, where the child
and on the commonalities across people and pro- can think abstractly. A number of tests have been
cessing rather than on the individual differences. developed to assess these intellectual stages, such
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 95

as the Concept Assessment Kit – Conservation by the three faces of intellect, which sees intellec-
Goldschmidt and Bentler, (1968). tual functions as composed of processes that are
5. The anthropological metaphor. Intelligence is applied to contents and result in products. In this
viewed in the context of culture, and must be con- model, there are five types of processes: mem-
sidered in relation to the external world. What ory, cognition, divergent thinking, convergent
is adaptive in one culture may not be adaptive production, and evaluation. These processes are
in another. Representative theories based on this applied to materials that can have one of four
model are those of J. W. Berry (1974) and Cole types of contents: figural, symbolic, semantic, or
(Laboratory of Comparative Human Cognition, behavioral. The result of a process applied to a
1982). These theories often take a strong nega- content is a product, which can involve units,
tive view of intelligence tests because such tests classes, relations, systems, transformations, and
are typically developed within the context of a implications.
particular culture and hence, it is argued, are These three facets, processes, contents, and
not generalizable. Those who follow this model products can interact to produce 120 separate
tend not to use tests in a traditional sense, but abilities (5 × 4 × 6), and for many years Guilford
rather develop tasks that are culturally relevant. and his colleagues sought to develop factor pure
We return to this issue in Chapter 11. tests for each of these 120 cells. Although the tests
6. The sociological metaphor. These theories, themselves have not had that great an impact, the
especially the work of Vygotsky (1978), empha- theoretical structure has become embedded in
size the role of socialization processes in the mainstream psychology, particularly educational
development of intelligence. In one sense, this psychology. We look at one test that emanates
model of intelligence focuses on the notion that directly from Guilford’s model (The Structure
a child observes others in the social environ- of Intellect Learning Abilities Test) and at some
ment and internalizes their actions; what hap- other tests based on this model when we discuss
pens inside the person (intelligence) first happens creativity in Chapter 8.
between people. This is not mere mimicry but A second theory is that of H. Gardner (1983)
a process that continues over time, and involves who postulates multiple intelligences, each dis-
continued interactions between child and others. tinct from each other. Note that this is unlike the
This method is almost by definition an observa- approach of factor analysts who view intelligence
tional method. For example, Feuerstein (1979) as composed of multiple abilities. H. Gardner
has developed a test called the Learning Poten- believes that there are seven intelligences that he
tial Assessment Device (LPAD). The LPAD con- labels as linguistic, logical-mathematical, spatial
sists of difficult tasks that the child tries to solve. (having to do with orientation), musical, bodily
Then the child receives a sequence of hints and kinesthetic (the ability to use one’s body as in
the examiner observes how the child profits from athletics or dancing), interpersonal intelligence
these hints. (understanding others), and intrapersonal intel-
ligence (understanding oneself). For now, this
theoretical model has had little influence on psy-
Other theories. Not all theories can be sub- chological testing, although it seems to have the
sumed under the six metaphors, and it might be potential for such an impact in the future.
argued that Sternberg’s schema, although quite
useful, is both simplistic and arbitrary; theories Cognitive approaches. Cognitive psychology
are much more complex and are often catego- has had a tremendous impact on how we perceive
rized because they emphasize one feature, but do brain functioning and how we think in theoreti-
not necessarily neglect other aspects. Two theo- cal terms about intelligence, although for now it
ries that have particular relevance to psycholog- has had less of an impact on actual assessment.
ical testing and perhaps require special mention Sternberg’s (1985; 1988b) theory of intelligence is
are those of Guilford (1959a; 1959b) and of H. a good example of the cognitive approach. Stern-
Gardner (1983). berg focuses on information processing and dis-
Guilford has presented a theoretical model tinguishes three kinds of information processing
called the structure of intellect, sometimes called components. There are the metacomponents that
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

96 Part Two. Dimensions of Testing

are higher order processes – such as recogniz- called the general factor, or g. Thus if we adminis-
ing the existence of a problem, defining what the ter several tests of intelligence to a group of peo-
problem is, and selecting strategies to solve the ple, we will find that those individuals who tend
problem. There are also performance components to score high on test A also tend to score high
that are used in various problem solving strate- on the other tests, and those who score low tend
gies – for example, inferring that A and B are sim- to score low on all tests. If we correlate the data
ilar in some ways but different in others. Finally, and do a factor analysis, we would obtain high
there are knowledge acquisition components that correlations between test scores that would indi-
are processes involved in learning new informa- cate the presence of a single, global factor. But
tion and storing that information in memory. the world isn’t perfect, and thus we find varia-
Sternberg’s theory has resulted in an intelligence tion. Marla may obtain the highest score on test
test – the Sternberg Triarchic Abilities Test – but A, but may be number 11 on test B. For Spear-
it is too new to evaluate. man, the variation could be accounted by spe-
Much of the criticisms of standard intelligence cific factors, called s, which were specific to par-
tests such as the Stanford-Binet, can be summa- ticular tests or intellectual functions. There may
rized by saying that these efforts focus on the also be group factors that occupy an intermedi-
test rather than the theory behind the test. In ate position between g and s, but clearly what
many ways, these tests were practical measures is important is g, which is typically interpreted
devised in applied contexts, with a focus on cri- as general ability to perform mental processing,
terion rather than construct validity. Primarily or a mental complexity factor, or agility of sym-
as part of the “revolution” of cognitive psychol- bol manipulation. A number of tests such as the
ogy, there has been a strong emphasis on dif- Raven’s Progressive Matrices and the D-48 were
ferent approaches to the study of intelligence, designed as measures of g, and are discussed in
approaches that are more theoretical, that focus Chapter 11 because they are considered “culture
more on process (how the child thinks) rather fair” tests. Spearman was British, and this single
than product (what the right answer is), and that factor approach has remained popular in Great
attempt to define intelligence in terms of basic Britain, and to some extent in Europe. It is less
and essential capacities (Horn, 1986; Keating & accepted in the United States, despite the fact
MacLean, 1987; K. Richardson, 1991). that there is substantial evidence to support this
The basic model for cognitive theories of intel- view; for example, A. R. Jensen (1987) analyzed
ligence has been the computer, which represents 20 different data sets that contained more than
a model of how the brain works. It is thus no 70 cognitive subtests and found a general factor
surprise that the focus has been on information in each of the correlation matrices. The disagree-
processing and specifically on two major aspects ment then, seems to be not so much a function
of such processing: the knowledge base and the of empirical data, but of usefulness – how useful
processing routines that operate on this knowl- is a particular conceptualization?
edge base (K. Richardson, 1991). The second approach, that of multiple factors,
is a popular one in the United States, promul-
From a psychometric perspective. The vari- gated quite strongly by early investigators such as
ous theories of intelligence can also be classi- T. L. Kelley (1928) and Thurstone (1938). This
fied into three categories (with a great deal of approach sees intelligence as composed of broad
oversimplification): (1) those that see intelligence multiple factors, such as a verbal factor, memory,
as a global, unitary ability; (2) those that see intel- facility with numbers, spatial ability, perceptual
ligence as composed of multiple abilities; and (3) speed, and so on. How many such multiple fac-
those that attempt to unite the two views into tors are there? This is the same question we asked
a hierarchical (i.e., composed of several levels) in the area of personality and just as in person-
approach. ality, there is no generally agreed-upon number.
The first approach is well exemplified by the Thurstone originally proposed 12 primary men-
work of Spearman (1904; 1927), who developed tal abilities while more current investigators such
the two-factor theory. This theory hypothesizes as Guilford have proposed as many as 120. In
that intellectual activities share a common basis, fact, there is no generally agreed naming of such
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 97

be hired because they met some minimal crite-


ria of performance. Because each of these con-
g
ditions places restrictions on the variability of
the sample and because of restrictions on the
major factors ratings of job performance, the “true” correla-
tion between general intelligence and job profi-
x y
ciency is substantially higher, particularly as a job
requires greater complexity (J. E. Hunter, 1986;
J. E. Hunter & R. F. Hunter, 1984). J. E. Hunter
(1986) pointed out that in well-executed stud-
ies where job proficiency is defined by objec-
a b c d e minor factors tive criteria rather than supervisors’ ratings, the
relationship between intelligence and job perfor-
mance could correlate as high as the mid .70s.
Yet, we should not expect a very high correlation
between any type of test and the amazing variety
specific factors of skills, accomplishments, etc., to be found in
a1 a2 a3 a4
different occupational activities (Baird, 1985).

Intelligence and academic achievement. Most


FIGURE 5–1. Schematic diagram of hierarchical
theories.
psychologists would agree that standard intelli-
gence tests are good measures or predictors of
academic achievement. The literature confirms
factors, and what one investigator may label as that there is a relationship between intelligence
a “perceptual speed” factor, another investigator test scores and academic achievement of about
may label quite differently (Ekstrom, French, & .50 (Matarazzo, 1972). Such a relationship is
Harman, 1979). somewhat higher in primary grades and some-
As always, there are “middle of the road” what lower in college (Brody, 1985). Keep in
approaches that attempt to incorporate the two mind that in college the grading scale is severely
opposing views. Several scientists have devel- restricted; in theory it is a 5-point scale (A to F),
oped hierarchical theories that generally take but in practice it may be even more restricted. In
a “pyramid” approach (e.g., Gustafsson, 1984; addition, college grades are a function of many
Humphreys, 1962; Vernon, 1960). At the top of more nonintellectual variables such as degree of
the pyramid there is Spearman’s g. Below that motivation, good study habits, outside interests,
there are two or three major group factors. Each than is true of primary grades. Intellectual abil-
of these is subdivided into minor group factors, ities are also more homogeneous among college
and these may be further subdivided into specific students than children in primary grades, and
factors. Figure 5.1 illustrates a “generic” hierar- faculty are not particularly highly reliable in their
chical theory. grading habits.

Academic vs. practical intelligence. Neisser


OTHER ASPECTS
(1976) and others have argued that typical intelli-
Intelligence and job performance. Correlations gence tests measure academic intelligence, which
between general intelligence and job proficiency is different from practical intelligence. Neisser
are typically in the .20s range (Ghiselli, 1966; (1976) suggests that the assessment of academic
1973). It’s been argued, however, that the typical intelligence involves solving tasks that are not
study in this area involves a sample that is prese- particularly interesting, that are presented by oth-
lected and homogeneous. For example, workers ers, and that are disconnected from everyday
at a particular factory who were hired on the basis experience. Practical intelligence involves solv-
of an application form and an interview probably ing tasks as they occur in natural settings, and are
survived a probationary period and continue to “interesting” because they involve the well-being
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

98 Part Two. Dimensions of Testing

of the individual involved. Various other terms at the beginning of the 1900s. They have been
are used for practical intelligence, such as social revised, and new ones have been introduced, but
competence or social-behavioral intelligence. the basic strategy and structure remains the same,
Wagner and Sternberg (1986) indicate that a despite the enormous advances that have been
good argument can be made for the need of prac- made in understanding how the brain functions
tical intelligence as related to successful perfor- (Linn, 1986). Of course, one can answer that their
mance in the real world, and they indicate that longevity is a sign of their success.
typical intelligence tests only correlate about .20 Others have argued that intelligence test items
with criteria of occupational performance, as we really do not measure directly a person’s ability
indicated above. They proceeded to assess what to learn or to perform a task rapidly and cor-
they call tacit knowledge, knowledge that is prac- rectly (e.g., Estes, 1974; Thorndike, 1926). Such
tical yet usually not directly taught, such as how items have been incorporated in some tests, and
to advance in one’s career. To assess such tacit we can argue, as we did in Chapter 2, that the
knowledge, they constructed a series of vignettes content of a test need not overlap with the pre-
with a number of alternative responses to be dictive criterion, so that a test could empirically
ranked. For example, a vignette might indicate predict a person’s ability to learn new tasks with-
that you are a young assistant manager who out necessarily using items that utilize new tasks.
aspires to be vice president of the company. The Still another criticism is that intelligence tests do
response alternatives list a number of actions you not adequately incorporate developmental theo-
might undertake to reach your career goal, and ries, such as the insights of Piaget, or structural
you are to rank order these alternatives as to theories such as Guilford’s model. Again, some
importance. What Wagner and Sternberg (1986) tests do, but the criticism is certainly applicable to
found is the same result as in studies of chess- most tests of intelligence. Others (e.g., Anastasi,
players, computer programmers and others – that 1983) argue that intelligence, when redefined in
experts and novices, or in this case seasoned exec- accord with what we now know, is a very useful
utives vs. beginners, differ from each other pri- construct.
marily in the amount and organization of their Finally, it is clear that the terms often associated
knowledge regarding the domain involved, rather with intelligence testing, such as “IQ,” “gifted,”
than the underlying cognitive abilities as mea- and “mentally defective,” are emotionally laden
sured by a traditional test of intelligence. terms and, in the mind of many lay persons,
Still others (e.g., Frederiksen, 1962) have related to genetic connotations. Such terms are
argued that traditional academic tests of intelli- being slowly abandoned in favor of more neutral
gence do not capture the complexity and imme- ones.
diacy of real life, and that activities that sim-
ulate such real life endeavors are more useful Intelligent testing. Both Wesman (1968) and
(see Sternberg, Wagner, Williams, & Horvath, A. S. Kaufman (1979a) have argued that intelli-
1995). gence testing should be “intelligent testing,” that
is that testing should focus on the person not
Criticisms. Probably more than any other type the test, that the skilled examiner synthesizes the
of test, intelligence tests have generated a great obtained information into a sophisticated total-
deal of controversy and criticism, often quite ity, with sensitivity to those aspects of the client
acrimonious (e.g., Hardy, Welcher, Mellits, & that must be taken into consideration, such as
Kagan, 1976; Hilliard, 1975; Lezak, 1988; Mercer, ethnic and linguistic background. This line of
1973; R. L. Williams, 1972; 1974). A. S. Kaufman thinking is certainly concordant with our defi-
(1979a) suggests that many of the criticisms of nition of a test as a tool; the more sophisticated
intelligence tests are more emotional than empir- and well-trained artisan can use that tool more
ically defensible, especially criticisms related to effectively.
supposed racial bias. But intelligence tests are cer-
tainly not perfect, and there seem to be a num- Age scale vs. point scale. Assume that as a
ber of valid criticisms. One criticism is that these homework assignment you were given the task
tests have not changed since the work of Binet to develop an intelligence test for children. There
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 99

are probably two basic ways you might go about Ever since its creation, the concept of IQ has
this. One way is to devise items that show a devel- been attacked as ambiguous, misleading, and
opmental progression – for example, items that limited. It was pointed out that two children with
the typical 5-year-old would know but younger the same mental age but with differing chrono-
children would not. If you were to find 12 such logical ages were qualitatively different in their
items you could simply score each correct answer intellectual functioning, and similarly two chil-
as worth 1 month of mental age (of course, any dren with the same IQ but with differing chrono-
number of items would work; they would just be logical and mental ages, might be quite different.
given proportional credit – so with 36 items, each In addition, mental age unlike chronological age,
correct answer would be counted one third of a is not a continuous variable beyond a certain
month). A 5-year-old child then, might get all age; a 42-year-old person does not necessarily
the 5-year-old items correct, plus 3 items at the have greater mental abilities than a 41-year-old
6-year level, and 1 item at the 7-year level. That person.
child would then have a mental age of 5 years and Wechsler (1939) proposed the concept of devi-
4 months. You would have created an age scale, ation IQ as an alternative to the ratio IQ. The devi-
where items are placed in age-equivalent cate- ation IQ consists of transforming a person’s raw
gories, and scoring is based upon the assignment score on the test to a measuring scale where 100 is
of some type of age score. This is the approach the mean and 16 is the standard deviation. Let’s
that was taken by Binet and by Terman in devel- assume for example that we have tested a sample
oping the Binet tests. of 218 nine-year-olds with an intelligence test.
Another alternative is, using the same items, to Their mean turns out to be 48 and the SD equals
simply score items as correct or not, and to cal- 3. We now change these raw scores to z scores and
culate the average score for 5-years-old, 6-years- then to scores that have a mean of 100 and a SD
old, and so on. Presumably, the mean for each of 16. We can tabulate these changes so that for
year would increase, and you could make sense any new 9-year-old who is tested, we can simply
of a child’s raw score by comparing that score to look up in a table (usually found in the manual
the age appropriate group. Now you would have of the intelligence test we are using) what the raw
created a point scale. This was the approach taken score is equivalent to. In our fictitious sample, we
by Wechsler in developing his tests. tested children all of the same age. We could also
have tested a sample that was somewhat more
heterogeneous in age, for example children aged
The concept of mental age. Just as we classify 5 to 11, and used these data as our norms.
people according to the number of years they have
lived – “he is 18 years old” – it would seem to make Item selection. You are interested in doing a
sense to describe people according to the level of study to answer the question whether males or
mental maturity they have achieved. Indeed such females are more intelligent. You plan to select
a concept has been proposed by many. One of a sample of opposite sex fraternal twins, where
the hallmarks of Binet’s tests was that such a con- one of the twins is male and the other female,
cept of mental age was incorporated into the test. because such a sample would presumably con-
Thus, a child was considered retarded if his or trol such extraneous and/or confounding aspects
her performance on the Binet-Simon was that as socioeconomic level, child rearing, type of
of a younger child. Terman further concretized food eaten, exposure to television, and so on.
the concept in the Stanford-Binet by placing test You plan to administer a popular test of intel-
items at various age levels on the basis of the ligence to these twins, a test that has been shown
performance of normal children. Thus, with any to be reliable and valid. Unfortunately for your
child taking the Stanford-Binet, a mental age plans, your study does not make sense. Why?
could be calculated simply by adding up the cred- Basically because when a test is constructed items
its for test items passed. This mental age divided that show a differential response rate for differ-
by chronological age, and multiplied by 100 to ent genders, or different ethnic groups, or other
eliminate decimals, gave a ratio called the intelli- important variables, are eliminated from consid-
gence quotient or IQ. eration. If for example, a particular vocabulary
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

100 Part Two. Dimensions of Testing

word would be identified correctly for its mean- we translate such coefficients into more directly
ing by more white children than minority chil- meaningful information? Sicoly (1992) provides
dren, that word would not likely be included in one answer by presenting tables that allow the
the final test. user to compute the sensitivity, efficiency, and
specificity of a test given the test’s validity, the
The need for revisions. At first glance, it may selection ratio, and the base rate. As we discussed
seem highly desirable to have tests that are fre- in Chapter 3, sensitivity represents the propor-
quently revised, so that the items are current, and tion of low performers (i.e. positives) on the cri-
so that they are revised or abandoned on the basis terion who are identified accurately by a particu-
of accumulated data obtained “in the field.” On lar test – that is, the proportion of true positives
the other hand, it takes time not only to develop to true positives plus false negatives. Efficiency
a test, but to master the intricacies of adminis- represents the proportion of true positives – that
tration, scoring, and interpretation, so that too is, the ratio of true positives to true positives plus
frequent revisions may result in unhappy con- false negatives. Finally, specificity represents the
sumers. Each revision, particularly if it is sub- proportion of high performers (i.e., negatives)
stantial, essentially results in a new instrument who are identified correctly by the test – that is,
for which the accumulated data may no longer the ratio of true negatives to true negatives plus
be pertinent. false positives.
We have barely scratched the surface on some
Understanding vs. prediction. Recall that tests of the issues involved in the psychological testing
can be used for two major purposes. If I am inter- of intelligence, but because our focus is on psy-
ested in predicting whether Susan will do well in chological testing, we need to look at a number
college, I can use the test score as a predictor. of different tests, and leave these basic issues for
Whether there is a relationship or not between others to explore and discuss.
test score and behavior, such as performance in
class, is a matter of empirical validity. The focus
THE BINET TESTS
here is on the test score, on the product of the per-
formance. If, however, I am interested in under- In 1904, the Minister of Public Instruction for the
standing how and why Susan goes about solving Paris schools asked psychologist, Alfred Binet, to
problems, then the matter becomes more com- study ways in which mentally retarded children
plicated. Knowing that Susan’s raw score is 81 or could be identified in the classroom. Binet was
that her IQ is 123 does not answer my needs. Here at this time a well-known psychologist and had
the focus would be more on the process, on how been working on the nature and assessment of
Susan goes about solving problems, rather than intelligence for some time. Binet and a collabora-
just on the score. tor, Theodore Simon, addressed this challenge by
One advantage of individual tests of intelli- developing a 30-item test, which became known
gence such as the Stanford-Binet or the Wechsler as the 1905 Binet-Simon Scale (Binet & Simon,
scales, is that they allow for observation of the 1905).
processes, or at least part of them, involved in
engaging in intellectual activities, in addition to The 1905 Binet-Simon Scale. This scale was the
yielding a summary score or scores. first practical intelligence test. The items on this
scale included imitating gestures and following
Correlation vs. assignment. For most tests, the simple commands, telling how two objects are
degree of reliability and validity is expressed alike, defining common words, drawing designs
as a correlation coefficient. Tests however, are from memory, and repeating spoken digits
often used with an individual client, and as we (T. H. Wolf, 1973). The 30 items were arranged
discussed in Chapter 3, correlation coefficients from easy to difficult, as determined by the per-
represent “nomothetic” data rather than ideo- formance of 50 normal children aged 3 to 11 and
graphic. Suppose we wanted to use test results to some mentally retarded children. The items were
assign children to discrete categories, such as eli- quite heterogeneous but reflected Binet’s view
gible for gifted placement vs. not eligible; how can that certain faculties, such as comprehension
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 101

and reasoning, were fundamental aspects of The 1937 Stanford-Binet. This revision con-
intelligence. sisted of two parallel forms, forms L and M, a
This scale was a very preliminary instrument, complete restandardization on a new sample of
more like a structured interview, for which no more than 3,000 children, including about 100
total score was obtained. The scale was simple children at each half year interval from ages 1
to administer and was intended for use by the to 5, 200 children at each age from 6 to 14, and
classroom teacher. The aim of the scale was essen- 100 children at each age from 15 to 18. The test
tially to identify children who were retarded and manual gave specific scoring examples (Terman
to classify these children at one of three levels of & Merrill, 1937). The sample was not truly rep-
retardation, which were called “moron, imbecile, resentative however, and the test was criticized
and idiot.” for this. Nevertheless, the test became very pop-
ular and in some ways represented the science of
The 1908 Binet-Simon Scale. The Binet-Simon psychology – quantifying and measuring a major
was revised and the 1908 scale contained more aspect of life.
items, grouped into age levels based on the per-
formance of about 300 normal children. For The 1960 Stanford-Binet. This revision com-
example, items that were passed by most 4-year- bined the best items from the two 1937 forms
olds were placed at the fourth-year level, items into one single form and recalculated the dif-
passed by most 5-year-olds were placed at the ficulty level of each item based on a sample of
fifth-year level, and so on from ages 3 to 13. A almost 4,500 subjects who had taken the 1937
child’s score could then be expressed as a mental scale between the years 1950 and 1954. A major
level or mental age, a concept that helped popu- innovation of this revision was the use of devia-
larize intelligence testing. tion IQ tables in place of the ratio IQ. Test items
on this form were grouped into 20 age levels,
The 1911 Binet-Simon Scale. A second revision with age levels ranging from 2 through “supe-
of the Binet-Simon scale appeared in 1911, the rior adult.” Representative test items consisted of
same year that Binet died. This revision had only correctly defining words, pointing out body parts
very minor changes, including the extension to on a paper doll, counting numbers of blocks in
age level 15 and five ungraded adult tests (for spe- various piles, repeating digits, and finding the
cific details on the Binet-Simon scales see Sattler, shortest path in a maze.
1982).
The Binet-Simon scales generated great inter- The 1972 Stanford-Binet. The 1972 revision
est among many American psychologists who made only some very minor changes on two
translated and/or adopted the scales. One of these items, but presented new norms based on approx-
psychologists was Terman at Stanford University, imately 2,100 subjects. To obtain a nationally rep-
who first published a revision of the Binet-Simon resentative sample, the 2,100 children were actu-
in 1912 (Terman and Childs, 1912) but subse- ally part of a larger stratified sample of 200,000
quently revised it so extensively that essentially children who had been tested to standardize a
it was a new test, and so the Stanford revision of group test called the Cognitive Abilities Test. The
the Binet-Simon became the Stanford-Binet. 2,100 children were selected on the basis of their
scores on the Cognitive Abilities Test, to be rep-
The 1916 Stanford-Binet. The first Stanford- resentative of the larger sample.
Binet was published in 1916 (Terman, 1916). This It is interesting to note that these norms
scale was standardized on an American sample of showed an increase in performance on the
about 1,000 children and 400 adults. Terman pro- Stanford-Binet, especially at the preschool ages,
vided detailed instructions on how to administer where there was an average increase of about 10
the test and how to score the items, and the term points. These increases apparently reflected cul-
“IQ” was incorporated in the test. It was clear that tural changes, including increasing level of educa-
the test was designed for professionals and that tion of parents, the impact of television, especially
one needed some background in psychology and “Sesame Street” and other programs designed to
psychometrics to administer it validly. stimulate intellectual development (Thorndike,
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

102 Part Two. Dimensions of Testing

Major Crystallized Fluid and Short-term


areas abilities analytical abilities memory

Scales Verbal Quantitative Abstract/Visual


Reasoning Reasoning Reasoning

Subtests Vocabulary Quantitative Pattern analysis Bead memory


Comprehension Number Series Copying Memory for sentences
Absurdities Equation building Matrices Memory for digits
Verbal relations Paper folding Memory for objects
FIGURE 5–2. Hierarchical model of the Stanford-Binet IV.

1977). This form also was criticized with respect which are nonverbal abilities involved in spatial
to the unrepresentativeness of the standardiza- thinking; these concepts were originally proposed
tion sample (Waddell, 1980). by R. B. Cattell (1963). Crystallized abilities are
further divided into verbal reasoning and quan-
The 1986 Stanford-Binet. This was the fourth titative reasoning, while fluid-analytic abilities
revision of the Stanford-Binet and its most exten- translate into abstract/visual reasoning. Finally,
sive to date (Hagen, Delaney, & Hopkins, 1987; there is a short-term memory area. Thus, the
Thorndike, Hagen, & Sattler, 1986a; 1986b). So 15 subtests of the Stanford-Binet IV are then
many changes were made on this version, referred assigned to these theoretical categories as indi-
to in the literature as the Stanford-Binet IV, that cated in Figure 5.2.
it might as well be considered a new test. The As with most other commercially published
earlier forms were age scales while this revision intelligence tests, the Stanford-Binet consists of a
was a point scale. The earlier forms were pre- package of products that include the actual test
dominantly verbal in focus, while the 1986 form materials (for example, a set of blocks to form
contained spatial, quantitative, and short-term into various patterns; a card with printed vocab-
memory items as well. This revision was designed ulary words), a record form that allows the exam-
for use from the age of 2 to adult. The standard- iner to record and/or keep track of the responses
ization sample consisted of more than 5,000 per- of the subject, often with summarized informa-
sons, from ages 2 to 23, stratified according to tion as to time limits, scoring procedures, etc.,
the 1980 U.S. Census on such demographic vari- a manual that gives detailed information on the
ables as gender, ethnicity, and area of residence. administration, scoring, and interpretive proce-
Despite such efforts, the standardization sample dures, and a technical manual that gives technical
had an overrepresentation of blacks, an under- details such as reliability and validity coefficients,
representation of whites, and an overrepresenta- etc. Typically, other materials may be available,
tion of children from higher socioeconomic-level either from the original publisher, other publish-
homes. ers, or in the literature. These materials might
The theory that subsumes the 1986 Stanford- include more detailed guides for specific types of
Binet is a hierarchical theory, with g at the top of clients, such as the learning disabled, or the gifted,
the hierarchy, defined as including information- computational aids to estimate standard errors,
processing abilities, planning and organizing and computerized procedures to score and inter-
abilities, and reasoning and adaptation skills. pret the test results.
Incorporated into this theory are also the con-
cepts of crystallized abilities, which are basi- Description. The 1986 scale is actually composed
cally academic skills, and fluid-analytic abilities, of 15 subtests. Within each of the subtests, the
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 103

items are arranged from easy to difficult. Prior to utive items. If a 10-year-old passes only 2 or 3
the 1986 revision, Stanford-Binet items included of the 4 items, then testing would be continued
actual toys and objects as part of the materi- downward to easier items until four consecutive
als to be administered. With the 1986 edition, items are passed. Presumably, items below this
only pictures of such objects were used. As indi- basal level are easier and, therefore, would be
cated in Figure 5.2, the 15 subtests are subsumed passed. The examiner must also determine the
under four content areas: (1) verbal reasoning, ceiling level, defined when three out of four con-
(2) abstract/visual reasoning, (3) quantitative secutive items are missed. Testing on the partic-
reasoning, and (4) short-term memory. ular subtest would then be discontinued. Note
Thus, the Stanford-Binet IV yields a composite that many tests of intelligence use a basal and a
score, four area scores, and 15 individual subtest ceiling level, but are not necessarily defined in the
scores. same manner as the 1986 Stanford-Binet. Thus,
Some of the subtests range in difficulty from only between 8 and 13 subtests are administered
ages 2 to 18; for example the Vocabulary and to any one individual subject.
the Comprehension subtests. Other subtests only Part of the reason for having such an admin-
cover the older years, from age 10 upwards (for istrative procedure is so that the test administra-
example, the Equation Building and the Paper tion begins at an optimal level, not so difficult as
Folding and Cutting subtests). to create discouragement or so easy as to result
in boredom. Another reason is time; we want to
Administration. As with other individual intel- maximize the amount of information obtained
ligence tests, the administration of the 1986 but minimize the amount of time required, as
Stanford-Binet requires a highly trained exam- well as reduce fatigue in the subject and/or the
iner. In fact, the Stanford-Binet can be consid- examiner.
ered more of a clinical interview than a simple
test. There are complex interactions that occur Scoring. Each item is scored as correct or incor-
between examiner and subject, and the astute rect. Scores on the items for the various subtests
examiner can obtain a wealth of information are then summed to yield raw scores for each of
not only about how the subject uses his or her the subtests. These raw scores are then changed to
intellectual capacities, but how well organized standard age scores or SAS, which are normalized
the child is, how persistent and confident, what standard scores with mean of 50 and SD of 8,
work methods are used, what problem-solving by using the subject’s age to locate the appro-
approaches are used, the reaction to success and priate normative tables in the test manual. In
failure, how frustration is handled, how the child addition, SAS can be obtained for each of the
copes with authority figures, and so on. four major areas and as a total for the entire test.
The 15 subtests are administered in a prede- These summary scores, however, are set so the
termined and mixed sequence, not as they are mean is 100 and the SD is 16, in keeping with the
listed in Figure 5.2, but with Vocabulary admin- earlier editions of the Stanford-Binet and with
istered first. A number of the subtests have prac- most other intelligence tests. What in earlier edi-
tice items so that the subject has a chance to prac- tions was called a deviation IQ score is now called
tice and to understand what is being requested. a test composite score. Additional guidelines for
The Stanford-Binet is an adaptive test, that is, administration, scoring, and interpretation of the
not all items are administered to all subjects, but Stanford-Binet can be found in various publica-
which items and subtests are administered are a tions (such as Delaney & Hopkins, 1987).
function of the subject’s chronological age and
performance. Where to begin on the Vocabulary Reliability. As you might imagine, there is con-
subtest is a function of the subject’s age. For all the siderable information about the reliability of the
other subtests, the entry level is determined from Stanford-Binet, most of which supports the con-
a chart for which the subject’s age and score on the clusion that the Stanford-Binet is quite reliable.
Vocabulary subtest are needed. In administering Most of the reliability information is of the inter-
each of the tests, the examiner must determine nal consistency variety, typically using the Kuder-
the basal level, defined as passing four consec- Richardson formula. At the subtest level, the 15
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

104 Part Two. Dimensions of Testing

subtests are reliable, with typical coefficients in administered at different ages, the factor struc-
the high .80s and low .90s. The one exception to ture of the test varies according to age. For exam-
this is the Memory for Objects subtest, which is ple, the test manual indicates that there are two
short and has typical reliability coefficients from factors at the preschool level, but four factors
the high .60s to the high .70s. The reliabilities of at the adolescent/adult level. At the same time,
the four area scores and of the total score are quite somewhat different factor structures have been
high, ranging from .80 to .99. reported by different investigators (e.g., Keith,
Some test-retest reliability information is also et al., 1988; R. B. Kline 1989; Sattler, 1988). Other
available in the test manual. For example, for two investigators have looked at the factor structure of
groups of children, 5-year-olds and 8-year-olds, the Stanford-Binet in a variety of samples, from
retested with an interval of 2 to 8 months, reli- elementary school children to gifted (e.g., Boyle,
ability coefficients of .91 and .90 were obtained 1989; Gridley, 1991; T. Z. Keith et al., 1988; R. B.
for the total score. Kline, 1989; McCallum, 1990; McCallum, Karnes
Because the Stanford-Binet requires a skilled & Crowell, 1988; Ownby & Carmin, 1988). At the
examiner, a natural question is that of interrater same time, it can be argued that the results of
reliability. Would two examiners score a test pro- the factor analyses do not fully support the the-
tocol identically? No such reliability is reported oretical model that gave birth to the Stanford-
in the test manual, and few studies are available Binet IV. All subtests do load significantly on g.
in the literature (Mason, 1992). Some of the subtests do load significantly on the
With the Stanford-Binet IV, scatter of sub- appropriate major areas, but there are exceptions.
test scores may in fact reflect unreliability of test For example, the Matrices subtest, which falls
scores or other aspects such as examiner error or under the Abstract/Visual reasoning area, actu-
situational variables, all of which lower reliabil- ally loads more highly with the Quantitative rea-
ity. Some investigators (Rosenthal & Kamphaus, soning area. Whether these exceptions are strong
1988; Spruill, 1988) have computed tables of con- enough to suggest that the theoretical model
fidence intervals that allow the test user to cor- is incorrect is debatable (Delaney & Hopkins,
rectly identify when subtest scores for a subject 1987).
are indeed different from each other and may A second source of validity information are
therefore reflect differential patterning of abili- the correlations obtained between scores on
ties. the Stanford-Binet and scores on other intelli-
gence tests, primarily with the Wechsler tests and
Validity. The assessment of validity for a test with earlier versions of the Stanford-Binet. The
like the Stanford-Binet is a complex undertaking, obtained correlation coefficients are too numer-
perhaps best understood in terms of construct ous to report here, but in general, show substan-
validity. The development of the 1986 Stanford- tial correlation between same type subtests. The
Binet was based not just on the prior editions correlations between the total Stanford-Binet
but on a series of complicated analyses of a scores and similar scores on other tests correlate
pool of potential items that were field tested and in the .80 to .91 range. Other investigators have
revised a number of times (Thorndike, Hagen, & compared the Stanford-Binet with the WISC-R
Sattler, 1986b). There were three major sources in gifted children (e.g., Phelps, 1989; Robinson
of validity information investigated by the test & Nagle, 1992) and in learning-disabled children
authors: (1) factor analysis, (2) correlations with (e.g. T. L., Brown, 1991; Phelps & Bell, 1988),
other intelligence tests, and (3) performance of with the WAIS-R (e.g., Carvajal, 1987a; Spruill,
“deviant” groups. 1991), with the WPPSI (Carvajal, 1991), with the
The results of various factor analyses indicate K-ABC (e.g., Hayden, Furlong, & Linnemeyer,
support for the notion that the correlations of 1988; Hendershott et al., 1990; Knight, Baker,
the 15 subtests can be accounted for by a general & Minder, 1990; Krohn & Lamp, 1989; Lamp &
factor. In addition, the results are also somewhat Krohn, 1990), and with other tests (e.g., Atkin-
consonant with the idea that not only is there one son, 1992; Carvajal, 1987c; 1988; Karr, Carvajal &
general factor, but there are at least three spe- Palmer, 1992). The Wechsler tests and the K-ABC
cific area factors. Because different subtests are are discussed below.
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 105

Finally, a number of studies with gifted, THE WECHSLER TESTS


learning-disabled, and mentally retarded chil-
David Wechsler, a psychologist long associated
dren, show the results to be consonant with group
with Bellevue Psychiatric Hospital in New York
membership. A substantial number of studies are
City, developed a series of three intelligence tests –
now available in the literature, with most indicat-
the Wechsler Adult Intelligence Scale (WAIS), the
ing substantial validity of the Stanford-Binet with
Wechsler Intelligence Scale for Children (WISC),
a variety of subjects (e.g., A. C. Greene, Sapp, &
and the Wechsler Preschool and Primary Scale of
Chissom, 1990; Knight, Baker, & Minder, 1990;
Intelligence (WPPSI). These tests have become
Krohn & Lamp, 1989; D. K. Smith, St. Martin, &
widely accepted and utilized by clinicians and
Lyon, 1989).
other professionals and, particularly at the adult
level, the WAIS has no competition. The Wechsler
Special forms. Assessment of special popula-
tests are primarily clinical tools designed to assess
tions, such as the hearing impaired, or those
the individual “totally,” with the focus more on
with learning disabilities, requires different
the process rather than the resulting scores.
approaches, including the modification of stan-
dard techniques. Glaub and Kamphaus (1991)
constructed a short form of the 1986 Stanford-
The WAIS
Binet by having school psychologists select those
subtests requiring the least amount of verbal Introduction. The WAIS had its beginnings in
response by the examinee, and little verbal 1939 as the Wechsler-Bellevue Intelligence Scale.
expression by the examiner. Of the 15 subtests, Wechsler (1939) pointed out that the then-
5 met the selection criteria. These five subtests available tests of intelligence, primarily the
are, as a totality, estimated to have a reliability of Stanford-Binet, had been designed to assess the
.95, and correlate .91 with the summary score for intelligence of children, and in some cases had
the total test. been adapted for use with adults simply by adding
Jacobson et al. (1978) developed a Spanish ver- more difficult items. He argued that many intel-
sion of the Stanford-Binet for use with Cuban ligence tests gave undue emphasis to verbal tasks,
children. Abbreviated forms are also available that speed of response was often a major compo-
(Carvajal, 1987b; Volker et al., 1999). nent, and that the standardization samples typ-
ically included few adults. To overcome these
Criticisms. The early Binet tests were criticized limitations, Wechsler developed the Wechsler-
on a number of grounds including inadequate Bellevue, with many of the items adapted from
standardization, a heavy emphasis on verbal the Binet-Simon tests, from the Army Alpha,
skills, items too heavily reflective of school which had been used in the military during World
experience, narrow sampling of the intellectual War I, and from other tests then in vogue (G. T.
functions assessed, inappropriate difficulty of Frank, 1983; A. S. Kaufman, 1990).
items with too easy items at the lower levels and In 1955, the Wechsler-Bellevue was replaced
too difficult items at the upper levels, and other by the WAIS, which was then revised in 1981 as
more technical limitations (Frank, 1983). The the WAIS-R, and was again revised in 1997 as
1986 revision has addressed most of the limi- the WAIS-3. The items for the WAIS scales were
tations of the earlier versions, but as indicated, selected from various other tests, from clinical
the standardization sample is still not fully rep- experience, and from many pilot projects. They
resentative, with an overrepresentation of chil- were thus chosen on the basis of their empir-
dren from professional-managerial homes and ical validity, although the initial selection was
college-educated parents. The results of the fac- guided by Wechsler’s theory of the nature of intel-
tor analyses are also not as uniform as one might ligence (Wechsler, 1958; 1975). The WAIS-R revi-
hope, and there is a bit of additional confusion sion was an attempt to modernize the content by,
generated by the test authors who do not agree as for example, including new Information subtest
to whether area scores or factor scores should be items that refer to famous blacks and to women,
used (Sattler, 1988; Thorndike, Hagen, & Sattler, to reduce ambiguity, to eliminate “controversial”
1986b). questions, and to facilitate administration and
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

106 Part Two. Dimensions of Testing

Table 5–1. The WAIS-R subtests


Verbal scale Description
Information This is a measure of range of knowledge. Composed of questions of
general information that adults in our culture presumably know, e.g.,
in which direction does the sun set?
Digit span Involves the repetition of 3 to 9 digits, and 2 to 8 backwards. Measures
immediate memory and the disruptive effects of anxiety.
Vocabulary Defining words of increasing difficulty. Measures vocabulary.
Arithmetic (T) Elementary school problems to be solved in one’s head. Presumably
measures the ability to concentrate.
Comprehension Items that attempt to measure common sense and practical judgment.
Similarities Requires the examinee to point out how two things are alike. Measures
abstract thinking.
Performance scale Description
Picture completion A series of drawings each with a detail that is missing. Measures
alertness to details.
Picture arrangement (T) Sets of cartoon like panels that need to be placed in an appropriate
sequence to make a story. Measures the ability to plan.
Block design (T) A set of designs are to be reproduced with colored blocks. Measures
nonverbal reasoning.
Object assembly (T) Puzzles representing familiar objects like a hand, are to be put together.
Measures the ability to perceive part-whole relationships.
Digit symbol (T) A code substitution task where 9 symbols are paired with 9 digits. The
examinee is given a sequence of numbers and needs to fill in the
appropriate symbols; has a 90-seconds time limit. Measures
visual-motor functioning.

Note: Subtests followed by a T are timed.

scoring by appropriate changes in the Manual. example, for the Information subtest each item
In addition, a new standardization sample was is scored as either correct or incorrect. But for
collected. the Comprehension subtest and the Similarities
subtest, some answers are worth 2 points, some
Description. The WAIS-R is composed of 11 1 point, and some 0. For the Object Assembly
subtests that are divided into 2 areas – the Verbal items, scoring is a function of both how many
Scale with 6 subtests, and the Performance scale of the puzzle pieces are correctly placed together,
with 5 subtests. Table 5.1 lists the subtests and a plus a time bonus; scores for the hand puzzle for
brief description of each. example, can vary from 0 to 11 points. A num-
ber of books are available for the professional
Administration. In the 1955 WAIS, the six verbal that give further guidance on administration,
subtests were presented first, followed by the five scoring, and interpretation (e.g., Groth-Marnat,
performance subtests. In the WAIS-R, they are 1984; Zimmerman, Woo-Sam, & Glasser, 1973).
administered by alternating a verbal and a per- Raw scores on each subtest are changed into
formance subtest in a prescribed order, beginning standard scores with a mean of 10 and SD of 3,
with Information. As indicated in Table 5.1, five by using the appropriate table in the test man-
of the subtests are timed, so that the score on ual. This table is based upon the performance of
these reflects both correctness and speed. 500 individuals, all between the ages of 20 and
34. The standard scores are then added up across
Scoring. The WAIS-R is an individual test of the six subtests that make up the Verbal scale,
intelligence that requires a trained examiner to to derive a Verbal score; a similar procedure is
administer it, to score it, and to interpret the followed for the five subtests of the Performance
results. The test manual gives detailed scoring scale to yield a Performance score, and the two are
criteria that vary according to the subtest. For added together to yield a Full Scale score. Using
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 107

Table 5–2. Classification of Wechsler IQs the subtests. Subtests like the Picture Arrange-
ment and Object Assembly seem, however, to
IQ Classification
be marginal with coefficients in the .60s. Inter-
130 & above Very superior estingly, average Full Scale IQ seems to increase
120–129 Superior
about 6 to 7 points upon retest, probably reflect-
110–119 High average or bright normal
90–109 Average ing a practice effect.
80–89 Low average or dull normal
70–79 Borderline Validity. Wechsler has argued that his scales have
69 & below Mentally retarded or mentally content and construct validity – that is, the scales
defective
themselves define intelligence. Thus, the Wech-
sler manuals that accompany the respective tests
the tables in the manual, these three scores can typically do not have sections labeled “validity,”
be changed to deviation IQs, each measured on a and the generation of such data, especially crite-
scale with mean of 100 and SD of 15. Micro- rion validity, is left up to other investigators.
computer scoring systems that can carry out The presence of content validity is argued by
the score conversions and provide brief reports the fact that the items and subtests included in
based on the subject’s test performance are now the WAIS-R are a reflection of Wechsler’s theory
available. of intelligence, and his aim of assessing intelli-
The Full Scale IQs obtained on any of the gence as a global capacity. Items were included
Wechsler scales are divided into seven nominal both on empirical grounds in that they corre-
categories, and these are listed in Table 5.2. lated well with various criteria of intelligence, as
well as logical grounds in that they were judged
Reliability. Reliability coefficients for the WAIS to be appropriate by experienced clinicians.
are presented for each of nine age groups, sep- There are however, a number of studies that
arately. Corrected split-half reliabilities for the address the criterion validity of the WAIS and
Full Scale IQ scores range from .96 to .98; for WAIS-R. These have typically shown high cor-
the Verbal IQ scores they range from .95 to relations between the two tests, and high corre-
.97; and for the Performance IQ scores from .88 lations with the Stanford-Binet and other intel-
to .94 (Wechsler, 1981). Similar coefficients are ligence tests. Other studies have demonstrated a
reported for the WAIS-R: for example, both the relationship between WAIS and WAIS-R scores to
Full Scale IQ and the Verbal IQ have coefficients various indices of academic success, with typical
of .97, and for the Performance IQ of .93. For the correlation coefficients in the .40s.
individual subtests, the corrected split-half relia-
bilities are lower, but the great majority of the Norms. The normative sample consisted of
coefficients are above .70. Split-Half reliability almost 1,900 individuals chosen so as to be rep-
is not appropriate for the Digit Symbol subtest resentative along a number of dimensions such
because this is a speeded test, nor for the Digit as race and geographical region of residence,
Span subtest, because this is administered as two according to U.S. Census data. These individu-
separate subtests (digits forward and digits back- als were distributed equally over nine age levels,
ward). For these two tests, alternate form reli- from years 16–17 to years 70–74, and were basi-
abilities are reported, based on comparisons of cally “normal” adults, exclusive of persons with
the WAIS-R with the WAIS, or with the WISC- severe psychiatric and/or physical conditions.
R (note that the WAIS does not have alternate
forms). The WAIS-R manual also includes stan- Stability over time. Aside from a reliability point
dard errors of measurement; for the Full Scale IQ of view, we can ask how stable is intelligence
and Verbal IQ these are below 3 points, while for over a period of time. A number of studies have
the Performance IQ it is 4.1. used the WAIS with different groups of subjects
Test-retest reliability coefficients, over an inter- such as college students, geriatric patients, and
val of 2 to 7 weeks, hover around .90 for the police applicants, and retested them after vary-
three summary scores (Verbal, Performance, and ing periods of time ranging from a few months
Full Scale), and in the .80s and .90s for most of to 13 years, and have found typical correlation
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

108 Part Two. Dimensions of Testing

coefficients in the .80s and .90s for the shorter IQ might be indicative of left hemisphere cerebral
time periods, and in the .70s for longer time peri- impairment (Goldstein & Shelly, 1975), under-
ods (e.g., H. S. Brown & May, 1979; Catron & achievement (Guertin, Ladd, Frank, et al., 1966),
Thompson, 1979; Kangas & Bradway, 1971). or delinquency (Haynes & Bensch, 1981).
Many indices of such pattern or profile analysis
The Deterioration Quotient. A somewhat have been proposed. Wechsler (1941) suggested
unique aspect of the WAIS tests is the observa- that differences larger than two scaled points
tion that as individuals age their performance on from the subtest mean of the person were signifi-
some of the WAIS subtests, such as Vocabulary cant and might reflect some abnormality; McFie
and Information, is not significantly impaired, (1975) suggested three points and other investi-
while on other subtests, such as the Block Design gators have suggested more statistically sophis-
and the Digit Symbol, there can be serious ticated indices (e.g., Burgess, 1991; Silverstein,
impairment. This led to the identification of 1984).
“hold” subtests (no impairment) and “don’t Part of the difficulty of pattern analysis is that
hold” subtests, and a ratio termed the Deteriora- the difference between subtests obtained by one
tion Quotient, although the research findings do individual may be reflective of diagnostic con-
not fully support the validity of such an index dition, of less than perfect reliability, or varia-
(e.g., J. E. Blum, Fosshage, & Jarvix, 1972; R. D. tion due to other causes that we lump together
Norman & Daley, 1959). as “error,” and we cannot disentangle the three
Wechsler argued that the intellectual deterio- aspects, particularly when the reliabilities are on
ration present as a function of aging could also be the low side as is the case with subtests such as
reflected in other forms of psychopathology and Object Assembly and Picture Arrangement.
that the Deterioration Quotient would be useful
as a measure of such deterioration. In fact, the Factor structure. Whether the Wechsler tests
research literature does not seem to support this measure g, two factors or three factors, is
point (e.g., Bersoff, 1970; Dorken & Greenbloom, an issue that, at present, remains unresolved,
1953). despite energetic attempts at providing a defini-
tive answer (e.g. Fraboni & Saltstone, 1992; Leck-
Pattern analysis. The use of the Wechsler scales liter, Matarazzo, & Silverstein, 1986). Verbal and
has generated a large amount of information Performance IQs typically correlate about .80.
on what is called pattern analysis, the meaning Scores on the verbal subtests generally correlate
of any differences between subtest scaled scores higher with the Verbal IQ than with the Perfor-
or between Verbal and Performance IQs. For mance IQ, while scores on the performance sub-
example, we normally would expect a person’s tests generally correlate higher with the Perfor-
Verbal IQ and Performance IQ to be fairly sim- mance IQ than with the Verbal IQ. (However,
ilar. What does it mean if there is a substantial the difference in correlation coefficients is typ-
discrepancy between the two scores, above and ically quite small, of the order of .10.) Factor
beyond the variation that might be expected due analytic studies do seem to suggest that there is
to the lack of perfect reliability? A number of one general factor in the WAIS, typically called
hypotheses have been proposed, but the experi- “general reasoning.” Many studies however, also
mental results are by no means in agreement. For find two to three other important factors, typ-
example, schizophrenia is said to involve both ically named “verbal comprehension,” “perfor-
impaired judgment and poor concentration, so mance,” and “memory” (J. Cohen, 1957). A sub-
schizophrenic patients should score lower on the stantial number of studies have factor analyzed
Comprehension and Arithmetic subtests than on the 1955 WAIS, and the results have been far
other subtests. Whether there is support for this from unanimous; these have been summarized
and other hypothesized patterns is highly debat- by Matarazzo (1972).
able (G. H. Frank, 1970). In addition, the same The WAIS-R also has been factor analyzed, and
pattern of performance may be related to several here too the results are equivocal. Naglieri and
diagnostic conditions. For example, a Perfor- A. S. Kaufman (1983) performed six fac-
mance IQ significantly higher than a Vocabulary tor analyses using different methods, on the
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 109

1,880 protocols from the standardization sample, 1974; J. D. King & Smith, 1972; Preston, 1978;
adults aged 16 to 74 years. The various meth- Yudin, 1966).
ods yielded anywhere from one to four factors Others have looked at a wide variety of subtest
depending on the age group. The authors con- combinations. For example, a commonly used
cluded that the most defensible interpretation abbreviated form of the WAIS is composed of
was two factors (Verbal and Performance), fol- the Arithmetic, Vocabulary, Block Design, and
lowed closely by three factors (Verbal, Perfor- Picture Arrangement subtests. These abbreviated
mance, and Freedom from Distractibility). scales are particularly attractive when there is
need for a rapid screening procedure, and their
Abbreviated scales. Basically, there are two ways attractiveness is increased by the finding that such
to develop a short form of a test that consists of abbreviated scales can correlate as high as .95
many sections or subtests, such as the WAIS or the to .97 with the Full Scale IQs (Silverstein, 1968;
MMPI. One way is to reduce the number of sub- 1970). McNemar (1950) examined the relation-
tests administered; instead of administering all ship of every possible combination of subtests,
11 subtests of the WAIS-R, for example, we could and found that they correlated in the .80 to .90
administer a subset of these that correlate sub- range with Full Scale IQ. Kaufman, Ishikuma,
stantially with the total test. This is what has been and Kaufman-Packer (1991) developed several
done with the WAIS. Another way, is to admin- extremely brief short forms of the WAIS-R that
ister all subtests, but to reduce the number of seem to be both reliable and valid. Still others have
items within the subtests. This second method, focused on the Vocabulary subtest because for
the item-reduction method, has several advan- many, vocabulary epitomizes intelligence. Vocab-
tages in that a wider sample of test behavior is ulary subtest scores, either in its regular length
obtained, and the scores for each subtest can be or in abbreviated form, typically correlate in the
calculated. Some empirical evidence also suggests .90s with Full Scale IQ (e.g., Armstrong, 1955;
that item-reduction short forms provide a more J. F. Jastak & J. R. Jastak, 1964; Patterson, 1946).
comparable estimate of the full battery total score Obviously, the use of an abbreviated scale
than do subtest-reduction short forms (Nagle & short-circuits what may well be the most valu-
Bell, 1995). able aspect of the WAIS, namely an experimental-
Short forms have two primary purposes: (1) clinical situation where the behavior of the sub-
to reduce the amount of testing time, and (2) to ject can be observed under standard conditions. It
provide valid information. C. E. Watkins (1986) is generally agreed, that such short forms should
reviewed the literature on the Wechsler short be administered only as screening tests rather
forms (at all three levels of adult, children, and than as an assessment or diagnostic procedure or
preschool) and concluded that none of the abbre- for research procedures where a rough estimate
viated forms could be considered valid as IQ mea- of intelligence is needed.
sures, but were useful as screening instruments.
For any test then, abbreviated forms are typi- Group administration. Although the Wechsler
cally developed by administering the original test, tests are individually administered tests, a num-
and then correlating various subtests or subset ber of investigators have attempted to develop
of items with the total score on the full form; group forms, typically by selecting specific sub-
thus the criterion in determining the validity of a tests and altering the administration procedures
short form is its correlation with the Full Scale IQ. so that a group of individuals can be tested simul-
Abbreviated forms of the Wechsler tests have been taneously (e.g., Elwood, 1969; Mishra, 1971).
proposed, by either eliminating items within sub- Results from these administrations typically cor-
tests, or simply administering a combination of relate in the .80 to .90 range with standard admin-
five or fewer subtests. Under the first approach, istration, although again such group administra-
a number of investigators have developed short tions negate the rich observational data that can
forms of the Wechsler tests by selecting subset be gathered from a one-on-one administration.
of items, such as every third item. These short
forms correlate in the .80 to .90 range with Full Examiner error. Most test manuals do not dis-
Scale IQ (e.g., Finch, Thornton, & Montgomery, cuss examiner error, perhaps based on the
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

110 Part Two. Dimensions of Testing

assumption that because clear administration 1950). Many of the items for the WISC were
and scoring guidelines are given, such error does taken directly from the Wechsler-Bellevue and
not exist. The evidence, however, is quite to the others were simply easier items modeled on the
contrary. Slate and Hunnicutt (1988) reviewed adult items. You might recall that the Stanford-
the literature on examiner error as related to the Binet had been criticized because some of its
Wechsler scales, and proposed several explana- items at the adult level were more difficult ver-
tory reasons for the presence of such error: sions of children’s items! A revised version of
(1) inadequate training and poor instructional the WISC, called the WISC-R was published in
procedures; (2) ambiguity in test manuals, in 1974. These two scales are quite comparable, with
terms of lack of clear scoring guidelines, and 72% of the WISC items retained for the WISC-R.
lack of specific instructions as to when to further The WISC-R was again revised in 1991 when it
question ambiguous responses; (3) carelessness became the WISC-III. Chattin (1989) conducted
on the part of the examiner, ranging from incor- a national survey of 267 school psychologists to
rect calculations of raw scores to incorrect test determine which of four intelligence tests (the
administration; (4) errors due to the relationship K-ABC, the Stanford-Binet IV, the WISC-R, and
between examiner and examinee – for example, the McCarthy Scales of Children’s Abilities) was
the finding that “cold” examiners obtain lower evaluated most highly. The results indicated that
IQs from their examinees than do “warmer” the WISC-R was judged to be the most valid mea-
examiners; and (5) job concerns for the examiner; sure of intelligence and the test that provided the
for example, greater errors on the part of exam- most useful diagnostic information.
iners who are overloaded with clients or are dis-
satisfied with their job. Description. The WISC-R consists of 12 sub-
tests, 2 of which are supplementary subtests, that
Criticisms. Despite the frequent use of the should be administered, but may be used as sub-
Wechsler tests, there are many criticisms in stitute subtests if one of the other subtests cannot
the literature. Some are identical to those of be administered. As with the WAIS, the subtests
the Stanford-Binet. Some are mild and easily are divided into Verbal and Performance and are
rebuked. Others are much more severe. G. Frank very similar to those found in the WAIS. Table 5.3
(1983) for example, in a thoughtful and thorough gives a listing of these subtests.
review of the Wechsler tests, concludes that they
are like a “dinosaur,” too cumbersome and not in Administration. As with all the Wechsler tests,
line with current conceptualizations of psycho- administration, scoring, and interpretation
metrics and of intelligence; he suggests therefore requires a trained examiner. Most graduate stu-
that it is time for them to become “extinct”! dents in fields such as clinical psychology take at
In spite of such severe judgments, the WAIS-R least one course on such tests and have the oppor-
continues to be used extensively, in both clinical tunity to sharpen their testing skills in externship
and research practice, and many of its virtues are and internship experiences. The WISC-R is par-
extolled. For example, contrary to popular opin- ticularly challenging because the client is a child
ion, one of the general findings for the Wechsler and good rapport is especially crucial.
tests is that they do not have a systematic bias The instructions in the test manual for admin-
against minority members (e.g., A. R. Jensen, istration and scoring are quite detailed and must
1976; A. S. Kaufman & Hollenbeck, 1974; D. J. be carefully followed. The starting point for some
Reschly & Sabers, 1979; Silverstein, 1973). of the WISC-R subtests varies as a function of the
child’s age. For most of the subtests, testing is dis-
continued after a specified number of failures; for
The WISC
example, testing is discontinued on the Informa-
The original Wechsler-Bellevue was developed as tion subtest if the child misses five consecutive
an adult test. Once this was done, it was extended items.
downward to assess children, and eventually
became the Wechsler Intelligence Scale for Chil- Scoring. Scoring the WISC-R is quite similar
dren or WISC. (Seashore, Wesman, & Doppelt, to scoring the WAIS. Detailed guidelines are
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 111

Table 5–3. WISC subtests discussed in Chapter 3).


The SE of measurement
Verbal scale

for the Full Scale IQ
Information  is about 3 points. This


Similarities 


 means that if we tested
Arithmetic 


 Annette and she obtained
Vocabulary 



Comprehension 
 a Full Scale IQ of 118,


Digit span∗  Description is identical to we would be quite confi-
 that of the WAIS. dent that her “true” IQ is


Performance scale 


 somewhere between 112


Picture completion 
 and 124 (1.96 times the


Picture arrangement 


 SE). This state of affairs is
Block design 

 portrayed in Figure 5.3.
Object assembly
Coding (like the Digit Symbol of the WAIS)
Mazes∗ (mazes of increasing difficulty) Validity. Studies com-

Digit span and Mazes are supplementary tests.
paring WISC scores with
various measures of
academic achievement
presented in the test manual as to what is con- such as grades, teachers’ evaluations, and so on,
sidered a correct response, and how points are to typically report correlation coefficients in the
be distributed if the item is not simply scored as .50s and .60s, with Verbal Scale IQs correlating
correct or incorrect. Raw scores are then changed higher than Performance Scale IQs with such
into normalized standard scores with mean of criteria. Correlations of WISC scores with scores
10 and SD of 3, as compared to a child’s own on the Stanford-Binet are in the .60s and .70s and
age group. These subtest scores are then added sometimes higher, again with the Verbal Scale IQ
and converted to a deviation IQ with mean of correlating more highly than the Performance
100 and SD of 15. Three total scores are thus Scale IQ, and with the Vocabulary subtest
obtained: a Verbal IQ, a Performance IQ, and a yielding the highest pattern of correlations of all
Full Scale IQ. As with both the Stanford-Binet subtests (Littell, 1960).
and the WAIS, there are a number of sources Studies comparing the WISC-R to the WISC
available to provide additional guidance for the show substantial correlations between the two,
user of the WISC-R (e.g., Groth-Marnat, 1984; typically in the .80s (e.g., K. Berry & Sherrets,
A. S. Kaufman, 1979a; Sattler, 1982; Truch, 1989). 1975; C. R. Brooks, 1977; Swerdlik, 1977; P. J.
A. S. Kaufman (1979a), in particular, gives some Thomas, 1980). In addition, scores on the WISC-
interesting and illustrative case reports. R have been correlated with scores on a substan-
Computer programs to score the WISC-R and tial number of other test scores, with the results
provide a psychological report on the client are supporting its concurrent and construct valid-
available, but apparently differ in their usefulness ity (e.g., C. R. Brooks, 1977; Hale, 1978; C. L.
(Das, 1989; Sibley, 1989). Nicholson, 1977; Wikoff, 1979).
Fewer studies have looked at the predictive
Reliability. Both split-half (odd-even) and test- validity of the WISC-R. Those studies that have,
retest (1-month interval) reliabilities are reported find that WISC-R scores, particularly the Ver-
in the test manual. For the total scores, they are all bal IQ, correlate significantly, often in the .40 to
in the .90s suggesting substantial reliability, both .60 range, with school achievement whether mea-
of the internal consistency and stability over time sured by grades, teachers’ ratings, or achievement
types. As one might expect, the reliabilities of the test scores (e.g., Dean, 1979; Hartlage & Steele,
individual subtests are not as high, but typically 1977; D. J. Reschly & J. E. Reschly, 1979).
range in the .70s and .80s.
The test manual also gives information on the Norms. The standardization sample for the
standard error of measurement and the standard WISC-R consisted of 2,200 children, with 100
error of the difference between means (which we boys and 100 girls at each age level, from 61/2 years
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

112 Part Two. Dimensions of Testing

FIGURE 5–3. Annette’s theoretical IQ


distribution.

112 115 118 121 124

her obtained score

The SE (or SD) is 3 points. Therefore we are


about 95% confident that her true IQ would
not deviate by more than 1.96 standard
deviations, or about 6 points.

through 161/2 years. These children came from 32 Factor structure. Lawson & Inglis (1985) applied
states and represented a stratified sample on the principal components analysis (a type of factor
basis of U.S. Census data. analysis) to the correlation matrices given in the
WISC-R manual. They obtained two factors. The
Pattern analysis. As with the WAIS, a number first was a positive factor, on which all items
of investigators have looked at pattern analysis loaded (i.e., correlated) positively, and was inter-
on the WISC, with pretty much the same out- preted as g or general intelligence. The second fac-
come (Lewandoski & Saccuzzo, 1975; Saccuzzo tor was a bipolar factor, with a negative loading
& Lewandoski, 1976). Here the concept of scat- on the verbal subtests and a positive loading on
ter is relevant, where the child performs in a the nonverbal subtests, a result highly similar to
somewhat inconsistent manner from the nor- Wechsler’s original distinction of verbal and per-
mal expectation – for example, missing some formance subtests. Indeed, many studies of the
easy items on a subtest but answering correctly factor structure of the WISC-R have consistently
on more difficult items, or showing high scores reported a Verbal Comprehension factor and a
on some of the verbal subtests but low scores Perceptual Organization factor that parallel quite
on others. Whether such scatter is diagnostic well the division of subtests into verbal and per-
of specific conditions such as emotional distur- formance (the one subtest that does not conform
bance or learning disability remains debatable very well is the Coding subtest). This factor pat-
(e.g., Bloom & Raskin, 1980; Dean, 1978; Hale tern has been obtained with a wide variety of sam-
& Landino, 1981; Ollendick, 1979; Thompson, ples that vary in ethnicity, age, clinical diagnosis,
1980; Zingale & Smith, 1978). and academic status. A third factor is also often
One measure of scatter is the profile variabil- obtained, and usually interpreted as a “freedom
ity index, which is the variance of subtest scores from distractibility” dimension. Its nature and
around an examinee’s mean subtest score (Plake, presence, however, seems to show some fluctu-
Reynolds, & Gutkin, 1981). A study of this index ation from study to study, so that perhaps this
in a sample of children who had been adminis- third factor assesses different abilities for differ-
tered the WISC-R, the Stanford-Binet IV, and the ent groups (A. S. Kaufman, 1979a).
K-ABC (see next section) indicated that such an Correlations between Verbal Scale IQs and
index had essentially no validity (Kline, Snyder, Performance Scale IQs are in the high .60s
Guilmette, et al., 1993). and low .70s, and indicate substantial overlap
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 113

between the two areas, but also enough inde- Several investigators have focused on the use
pendence to justify the use of the two summary of WISC-R short forms to screen and iden-
scores. tify intellectually gifted students (e.g., Elman,
Other factor analyses of the WISC have sug- Blixt, & Sawacki, 1981; Kramer, Shanks, Markely,
gested a factor structure similar to that of the et al., 1983; Ortiz & Gonzalez, 1989; Ortiz &
WAIS, including a general factor, and at least Volkoff, 1987). Short forms for the WISC-R and
three other substantial factors of verbal com- the WISC-III for particular use with learning-
prehension, perceptual-spatial functioning, and disabled students are available (Dumont & Faro,
a memory or freedom from distractibility factor 1993).
(Gutkin & Reynolds, 1980; 1981; Littell, 1960;
Van Hagan & Kaufman, 1975; Zimmerman & Use with minority children. We discuss this
Woo-Sam, 1972). These three factors have also issue more fully in Chapter 11, but mention
been obtained in studies of learning-disabled should be made that a number of researchers
children and mentally retarded children (e.g., have investigated the validity of the WISC-R with
Cummins & Das, 1980; Naglieri, 1981). The third minority children. The results generally support
factor is a rather small factor, and a number of the validity of the WISC-R, but also have found
alternative labels for it have been proposed. some degree of cultural bias (Mishra, 1983).
Bannatyne (1971) proposed a recategorization Studies of the WISC-R with Mexican-American
of the WISC subtests into four major categories: children yield basically the same results as with
Anglo children with regard to the reliability, pre-
verbal-conceptual ability (Vocabulary + Com- dictive validity, and factor structure (Dean, 1977;
prehension + Similarities subtests) 1979; 1980; Johnson & McGowan, 1984). For
acquired information (Information + Arith- a Mexican version of the WISC-R see Mercer,
metic + Vocabulary subtests) Gomez-Palacio, and Padilla (1986). There is also
visual-spatial ability (Block Design + Object a Mexican form of the WISC-R published (Wech-
Assembly + Picture Completion subtests) sler, 1984) whose construct validity seems to par-
allel the American version (Fletcher, 1989). Stud-
sequencing (Coding + Digit Span + Arithmetic
ies of the WISC-R with black American children
subtests)
also indicate that the test is working as intended,
with results similar to those found with white
The argument for such a recategorization is that
children (Gutkin & Reynolds, 1981).
the analysis is more meaningful with learning-
disabled children, and the results are more easily
The WISC-III. The WISC-R was revised in 1991
interpretable to teachers and parents. The focus
and became the WISC-III; many of the ear-
would not be so much on the IQ, but on the
lier items were revised, either in actual content
measurement of abilities.
or in form, as for example, enlarged printing.
Although the word “revision” might convey the
Abbreviated scales. As with the WAIS, a number image of a single person making some minor
of efforts have been made to identify combina- changes in wording to a manuscript, revision as
tions of WISC subtests that correlate highly with applied to a commercially produced test such as
the Full Scale IQ of the entire WISC (Silverstein, the WISC-III is a massive undertaking. Experts
1968; 1970), or administering every other item, or are consulted, and the experiences of users in the
every third item, but including all subtests (Silver- field are collated and analyzed. Banks of items
stein, 1967). One such subtest combination con- are submitted to pilot studies and statistically
sists of the Vocabulary and Block Design subtests; analyzed to identify and minimize any poten-
scores here correlate in the .80s with the Full Scale tial sources of bias, especially gender and race.
IQ of the entire WISC-R (Ryan, 1981). Typically, Details such as the layout of answer sheets to
quite high correlations (in the .80s and .90s) are equally accommodate right- and left-handed per-
obtained between abbreviated forms and the full sons, and the use of color art work that does
test, and these abbreviated forms can be useful as not penalize color blind subjects, are attended
screening devices, or for research purposes where to. Considerable reliability and validity data is
only a summary IQ number is needed. presented in the test manual, and in the research
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

114 Part Two. Dimensions of Testing

literature (e.g. Canivez & Watkins, 1998), with The corrected odd-even reliabilities of the WPPSI
results very similar to those obtained with the subtests are mostly in the .80s. For the Ver-
WISC-R and presented above. Factor analyses of bal and Performance scales reliability is in the
the WISC III yield two factors that seem to corre- high .80s, with Verbal slightly more reliable than
spond to the Verbal and the Performance scales Performance; for the Full Scale IQ, reliability is
(Wechsler, 1991). in the low .90s. Similar level reliabilities have
been reported in the literature for children rep-
resenting diverse ethnic backgrounds and intel-
The WPPSI
lectual achievement (Henderson & Rankin, 1973;
The Wechsler Preschool and Primary Scale of Richards, 1970; Ruschival & Way, 1971).
Intelligence (WPPSI) was published in 1967
(Wechsler, 1967) and covers 4 to 61/2 years. It Validity. The results of validity studies of the
pretty much parallels the WAIS and the WISC WPPSI have produced a wide range of find-
in terms of subtests, assessment of reliability, ings (Sattler, 1982; 1988). Scores on the WPPSI
and test format. In fact, 8 of the 11 subtests have been correlated with a variety of scores on
are revisions or downward extensions of WISC other tests, with typical correlation coefficients
subtests. The WPPSI does contain three sub- between the Full Scale IQ and other test mea-
tests that are unique to it: “Animal house,” which sures in the .50 to .70 range (e.g., Baum & Kelly,
requires the child to place colored cylinders in 1979; Gerken, 1978; B. L. Phillips, Pasewark, &
their appropriate holes, under timed conditions; Tindall, 1978). Keep in mind that for many of
“geometric design,” which is a perceptual motor these samples, the children tested were homoge-
task requiring copying of simple designs; and neous – for example, retarded – and, as you recall,
“sentences,” a supplementary test that measures homogeneity limits the size of the correlation.
immediate recall and requires the child to repeat As might be expected, scores on the WPPSI
each sentence after the examiner. The WPPSI was correlate substantially with scores on the WISC-
revised in 1989 (Wechsler, 1989) to become the R, in the order of .80 (Wechsler, 1974), and
WPPSI-R, with age coverage from 3 to 71/4 years, with the Stanford-Binet. Sattler (1974) reviewed
but similar in structure to the WPPSI (see the a number of such studies and reported that the
September 1991 issue of the Journal of Psychoe- median correlations between the WPPSI Ver-
ducational Assessment, a special issue devoted to bal, Performance, and Full Scale IQs and the
the WPPSI-R). Stanford-Binet IQ were .81, .67, and .82, respec-
tively. Despite the fact that the tests correlate sub-
Administration. It takes somewhere between 1 stantially, it should be noted that the IQs obtained
and 11/2 hours to administer the WPPSI, and the from the WPPSI and from the Stanford-Binet are
manual recommends that this be done in one not interchangeable. For example, Sewell (1977)
testing session. found that the mean WPPSI IQ was higher than
that of the Stanford-Binet, while earlier studies
Scoring. As with the other Wechsler tests, raw found just the opposite (Sattler, 1974).
scores on each subtest are changed to normalized Fewer studies are available on the predic-
standard scores that have a mean of 10 and a SD of tive validity of the WPPSI, and these typically
3. The subtests are also grouped into a verbal and attempt to predict subsequent academic achieve-
a performance area, and these yield a Verbal Scale ment, especially in the first grade or later IQ. In
IQ, a Performance Scale IQ, and a Full Scale IQ; the first instance, typical correlations between
these are deviation IQs with a mean of 100 and WPPSI scores and subsequent achievement are
SD of 15. The raw score conversions are done by in the .40 to .60 range (e.g, Crockett, Rardin, &
using tables that are age appropriate. This means, Pasewark, 1975), and in the second are higher,
that in effect, older children must earn higher typically in the .60 to .70 range (e.g., Bishop &
raw scores than younger children to obtain the Butterworth, 1979). A number of studies have
equivalent standard score. looked at the ability of WPPSI scores to predict
reading achievement in the first grade. Typical
Reliability. The reliability of the WPPSI is com- findings are that with middle-class children there
parable with that of the other Wechsler tests. is such a relationship, with modal correlation
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 115

coefficients in the .50s. With minority and disad- Abbreviated scales. A. S. Kaufman (1972)
vantaged children no such relationship is found – developed a short form of the WPPSI composed
but again, one must keep in mind the restriction of four subtests: Arithmetic and Comprehen-
of range both on the WPPSI scores and on the sion from the Verbal Scale, and Block Design
criterion of reading achievement (e.g., Crockett, and Picture Completion from the Performance
Rardin, & Pasewark, 1976; Serwer, B. J. Shapiro, Scale. The reliability of this short form was in
& P. P. Shapiro, 1972; D. R. White & Jacobs, the low .90s, and scores correlated with the Full
1979). Scale IQ of the entire WPPSI in the .89 to .92
Because the term construct validity is an range. Other investigators have also developed
umbrella term subsuming other types of validity, short forms for both the WPPSI and the WPPSI-
all of the studies mentioned so far can be con- R (e.g., Tsushima, 1994).
sidered as supportive of the construct validity of
the WPPSI. A number of other findings might The WPPSI-R. As with other major tests, while
be mentioned here. R. S. Wilson (1975) studied the author may be a single individual, the actual
monozygotic (identical) twins and dizygotic (fra- revision is typically a team effort and involves a
ternal) twins and found that monozygotic twins great many people; and so it is with the WPPSI-R.
were closer in intelligence to each other than were The WPPSI-R consists of 12 subtests, designed
dizygotic twins – a finding that is in line with the as downward extensions of the WISC-R. Typi-
view that intelligence has a substantial heredi- cally, five Verbal and five Performance subtests
tary/genetic component, and one that supports are administered, with Animal Pegs (formerly
the construct validity of the WPPSI. Other studies Animal House), and Sentences (similar to the
have focused on the relationship of IQ to socioe- Digit Span subtest of the WISC-R) as optional
conomic status (e.g., A. S. Kaufman, 1973), and subtests. The Object Assembly subtest is admin-
on language spoken at home (e.g., Gerken, 1978). istered first; this is a puzzle like activity that
preschoolers usually enjoy, and thus is helpful
Norms. The WPPSI was standardized on a in establishing rapport. Testing time is about 75
national sample of 1,200 children, with 200 chil- minutes, which may be too long for the typical
dren (100 boys and 100 girls) at each of six half- young child.
year age levels, from age 4 to 6. The sample was The primary purpose of the WPPSI-R is to
stratified using census data. diagnose “exceptionality,” particularly mental
retardation and giftedness, in school settings.
Factor structure. As with the other Wechsler In addition to the extended age range, the
tests, the subtests of the WPPSI and the Verbal WPPSI-R differs from the WPPSI in several ways.
and Performance scales intercorrelate with each Approximately 50% of the items are new. Sev-
other significantly. Subtests typically correlate .40 eral of the subtests have more rigorous scor-
to .60, and Verbal and Performance IQs corre- ing rules designed to reduce examiner error and
late in the mid .60s. The results of factor ana- hence increase the reliability of the subtests. The
lytic studies suggest a general factor, as well as WPPSI-R also includes an Object Assembly sub-
two broad factors, a verbal factor and a perfor- test, patterned after the same named subtest on
mance factor, although the verbal component is the WISC-R and the WAIS-R.
a much more important one, and may be inter- The normative sample for the WPPSI-R con-
preted as a general factor (Coates & Bromberg, sisted of 1,700 children aged 3 years through
1973; Heil, Barclay, & Endres, 1978; Hollenbeck & 7 years-3 months, with equal numbers of boys
Kaufman, 1973; Ramanaiah & Adams, 1979). In and girls. The sample was stratified according
the younger aged children, the two broad factors to U.S. Census data on such variables as race,
are less distinct from each other, a finding that is geographical residence, parental education, and
in line with developmental theories that hypothe- occupation.
size intellectual functioning to evolve as the child Considerable reliability evidence is available.
grows older, into more specialized and distinct For example, test-retest reliability for a sample of
categories. Similar results have been obtained 175 children retested with a mean 4-week period,
with black children (Kaufman & Hollenbeck, ranged from .59 to .82 for the subtests, .88 for the
1974). Performance IQ, .90 for the Verbal IQ, and .91
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

116 Part Two. Dimensions of Testing

for the Full Scale IQ. Split-half reliabilities for which despite receiving highly laudatory reviews
the subtests range from .63 to .86 (with a median in the MMY (Embretson, 1985a; Wright & Stone,
r of .83), and .92 for the Performance IQ, .95 1985), was virtually unknown in the United States
for the Verbal IQ, and .96 for the Full Scale IQ. until it was “retranslated” and restandardized
For four of the subtests, examiners need to make with an American sample and called the Differ-
subjective scoring decisions (a response can be ential Ability Scales (DAS)(Elliott, 1990a).
given 0, 1, or 2 points); so for these subtests inter-
scorer reliability becomes a concern. For a sample
Description. The BAS is an individual intelli-
of 151 children, two groups of scorers indepen-
gence test designed for ages 21/2 to 171/2, and con-
dently scored the protocols. Obtained reliability
tains 23 scales that cover 6 areas and yield 3 IQ
coefficients for the four subtests were all in the
scores. The six areas are: (1) speed of information
mid .90s.
processing, (2) reasoning, (3) spatial imagery,
As with all tests, there are criticisms. The
(4) perceptual matching, (5) short-term mem-
WPPSI-R (as well as other tests like the K-ABC
ory, and (6) retrieval and application of knowl-
and the DAS) use teaching or demonstration
edge. The three IQ scores are General, Visual, and
items, which are generally regarded as a strength
Verbal.
in preschool measures because they ensure that
Each of the six areas is composed of a num-
the child understands what is being asked. The
ber of subscales; for example, the Reasoning area
impact of such items on test validity has been
is made up of four subscales, while the Retrieval
questioned (Glutting & McDermott, 1989). Per-
and Application of Knowledge area is made up of
haps the major criticism that has been voiced is
seven subscales. All of the subscales are appropri-
that the WPPSI-R continues assessment in a his-
ate for multiple age levels. For example, the Block
torical approach that may well be outdated and
Design subscale is appropriate for ages 4 to 17,
does not incorporate findings from experimen-
while the Visual Recognition subscale is appro-
tal studies of cognitive processing. There is no
priate for ages 21/2 to 8. Thus, which subscales
denying that the test works but does not advance
are used depends on the age of the child being
our basic understanding of what intelligence is
tested.
all about (Buckhalt, 1991).
The BAS is unusual in at least two aspects:
Extensive validity data are also available,
it was developed using very sophisticated psy-
including concurrent correlations with other
chometric strategies, and it incorporates vari-
cognitive measures, factor analyses, and stud-
ous theories in its subscales. Specifically, the BAS
ies of group differentiation for gifted, mentally
subscales were developed according to the Rasch
retarded, and learning disabled. Most of the
latent trait model; this is a very sophisticated psy-
evidence is in line with that obtained with the
chometric theory and has procedures that are
WPPSI.
beyond the scope of this book (Rasch, 1966). Two
of the subscales on the BAS are based on the devel-
OTHER TESTS opmental theories of Piaget, and one subscale,
that of Social Reasoning, is based on Kohlberg’s
The British Ability Scales (BAS) (1979) theory of moral reasoning.
Finally, two subtests, Word Reading and Basic
Both the Stanford-Binet and the Wechsler tests
Arithmetic, both for ages 5 to 14 and both from
have become very popular, not just in the United
the Retrieval and Application of Knowledge area,
States but in other countries as well, including
can be used to estimate school achievement.
Britain. However, from the British perspective,
these were “foreign imports,” and in 1965 the
British Psychological Society set up a research Administration and scoring. Administration
project to replace the Stanford-Binet and the and scoring procedures are well designed, clearly
WISC and develop a measure standardized on specified, and hold examiner’s potential bias to
a British sample, one that would provide a pro- a minimum. The raw scores are changed to T
file of special abilities rather than an overall IQ. scores and to percentiles and are compared with
The result was the British Ability Scales (BAS), appropriate age norms.
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 117

Reliability and validity. Unfortunately, little “cognitive” and 3 are achievement subtests. The
data is given in the test manual about reliability age range goes from 21/2 to 17 years, 11 months.
and about validity. Embretson (1985a) reports One of the major objectives in the development of
that the results of factor analyses are available, as the DAS was to produce subtests that were homo-
well as the results of five concurrent validity stud- geneous and hence highly reliable, so that an
ies all with positive results, but the details are not examiner could identify the cognitive strengths
given. Similarly, Wright and Stone (1985) indi- and weaknesses of an examinee. Administration
cate that there is “ample evidence” for the internal of the DAS requires entry into each subtest at a
consistency and construct validity of the scales, level appropriate for the age of the subject. Cues
but no details are provided. A few studies are on the record form indicate age-related entry lev-
available in the literature, but their volume in no els and decision points for either continuing or
way approaches the voluminous literature avail- retreating to an earlier age level.
able on the Stanford-Binet and on the Wechsler Twelve of the DAS cognitive subtests are identi-
tests. For an application of the BAS to learning- fied as core subtests because they have high load-
disabled children see Elliott and Tyler (1987) and ings on g. Groupings of two or three of these
Tyler and Elliott (1988). Buckhalt studied the BAS subtests result in subfactors called cluster scores;
with black and white children (Buckhalt, 1990) these are Verbal and Nonverbal at the upper
and students from the United States (Buckhalt, preschool level, and Verbal, Nonverbal Reason-
Denes, & Stratton, 1989). ing, and Spatial ability at the school-age level.
An additional five cognitive subtests are labeled
Norms. The norms were carefully constructed to diagnostic subtests; these have low g loadings,
create a representative sample for Britain. There but presumably are useful in assessment. These
are 113 school districts in Britain, and 75 of these subtests measure short-term memory, perceptual
participated in the norming effort, which yielded skills, and speed of information processing.
a sample of 3,435 children. The DAS yields five types of scores: (1) subtest
raw scores that are converted to (2) ability scores,
Interesting aspects. Because of the Rasch psy- using appropriate tables. These are not norma-
chometric approach used in the development and tive scores, but provide a scale for judging per-
standardization of the items, the BAS can be seen formance within a subtest. These ability scores
as an “item bank” where the individual examiner are then converted to (3) T scores for normative
can, in effect, add or delete specific items to form comparisons. The T scores can be summed to
their own subtests, without losing the benefits of obtain (4) cluster scores, which in turn yield (5)
standardization (Wright & Stone, 1985). Another the General Conceptual Ability score. T scores,
aspect that follows from the Rasch model is that cluster scores, and GCA score can be converted
it is possible to compare subscale differences for to percentiles, standard scores, or age-equivalent
a specific child through procedures indicated in scores with use of the appropriate tables.
the manual. The subtests cover a range of abilities includ-
ing both verbal and nonverbal reasoning, visual
Criticisms. As Embretson (1985a) stated, the and auditory memory, language comprehension,
BAS possesses excellent potential and great psy- speed of information processing, and school
chometric sophistication, but the 1985 data achievement in basic number skills, spelling, and
on reliability and validity was judged inade- word reading. The battery does not yield a global
quate. For a review of the BAS see Buckhalt composite score derived from all subtests, as one
(1986). would find on the WISC-R for example. There
is however, a General Conceptual Ability (GCA)
The Differential Ability Scales. The BAS was score, a measure of g, based on four to six subtests,
introduced in the United States as the DAS and depending on the child’s age. T. Z. Keith (1990)
seems well on its way toward becoming a popu- concluded that the DAS is a robust measure of
lar test. The DAS is very similar to the BAS; some g; that for preschool children the DAS measures
BAS subtests were eliminated or modified, so that Verbal and Nonverbal abilities in addition to
the DAS consists of 20 subtests, 17 of which are g, and that for school-aged children the DAS
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

118 Part Two. Dimensions of Testing

measures verbal ability and spatial-reasoning The DAS manual (Elliott, 1990a) reports sev-
skill. eral validity studies, including correlations with
Elliott (1990a) not only indicates that the DAS the WISC-R (mid .80s) and with the Stanford-
has a broad theoretical basis and could be inter- Binet IV (high .70s to high .80s). The literature
preted from a variety of theories, but also that the also contains a number of studies supportive of
term General Conceptual Ability is a better term the Validity of the DAS (e.g., McIntosh, 1999).
than IQ or intelligence. The DAS is said to be a Recent reviews of the DAS are quite favorable and
“purer” and more homogeneous measure than point to its technical excellence and potential use
the global scores used by the Stanford-Binet or with minority children (Braden, 1992).
the Wechsler scales, primarily because the GCA
is composed only of those subtests that had high
The Kaufman Assessment Battery for
loadings on g, whereas the other tests include
Children (K-ABC)
in their composite scores subtests with low g
loadings. Kaufman (1983) describes intelligence as the abil-
One concept particularly relevant to the DAS, ity to process information effectively to solve
but also applicable to any test that yields a pro- unfamiliar problems. In addition, he distin-
file of subtest scores, is the concept of specificity guished between sequential and simultaneous
(note that this is a different use of the word from processing. A number of other theorists have ana-
our earlier discussion). Specificity can be defined lyzed intellectual functioning into two modes of
as the unique assessment contribution of a sub- mental organization. Freud, for example, spoke
test. If we have a test made up of three subtests A, of primary and secondary processes. Recently,
B, and C, we would want each of the subtests to Guilford (1967b) focused on convergent and
measure something unique, rather than to have divergent thinking, while R. B. Cattell (1963)
three subtests that are essentially alternate forms used the terms fluid and crystallized intelligence,
of the same thing. Specificity can also be defined and Wechsler (1958) used verbal and nonver-
psychometrically as the proportion of score vari- bal intelligence. One dichotomy that has found
ance that is reliable and unique to the subtest. its way in a number of tests and test interpre-
Specificity can be computed by subtracting the tations is the notion of sequential (or succes-
squared multiple correlation of each subtest with sive) and simultaneous processing (Luria, 1966).
all other subtests, from the reliability of the sub- Sequential processing requires the organization
test. For example, if subtest A has a reliability of of stimuli into some temporally organized series,
.90 and correlates .40 with both subtests B and C, where the specific order of the stimuli is more
then its specificity will be .90 – (.40)2 = .74. For important than the overall relationship of these
the DAS, the average specificity for its diagnostic stimuli. For example, as you read these words
subtests is about .73, while for the WISC-R it is it is their sequencing that is important for the
about .30 (Elliott, 1990b). words to have meaning. Sequential processing is
The DAS was standardized on 3,475 U.S. typically based on verbal processes and depends
children selected on the basis of U.S. Census on language for thinking and remembering; it
data. An attempt was made to include special- is serial in its nature. Simultaneous processing
education children such as learning disabled involves stimuli that are primarily spatial and
and speech impaired, but severely handicapped focuses on the relationship between elements. To
children were not included. Gifted and talented understand the sentence “this box is longer than
children are slightly overrepresented. this pencil,” we must not only have an under-
Reliability is fairly comparable with that of the standing of the sequence of the words, we must
Wechsler tests. For example, mean internal reli- also understand the comparative spatial relation-
ability coefficients range from .70 to .92 for vari- ship of “longer than.” Simultaneous processing
ous subtests; for the GCA they range from .90 to searches for patterns and configurations; it is
.95. Test-retest reliability, based on 2 to 6 weeks, holistic. A. S. Kaufman (1979b) suggested that
yielded a GCA coefficient of .90, and interater the WISC-R subtests could be organized along
reliability for the four subtests that require sub- the lines of sequential vs. simultaneous process-
jective judgment is in the .90 range. ing. For example, Coding and Arithmetic require
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 119

sequential processing, while Picture Completion this scale may be administered in pantomime and
and Block Design require simultaneous process- are responded to with motor rather than verbal
ing. He then developed the K-ABC to specifically behavior, for example by pointing to the correct
assess these dimensions. response.
The K-ABC is a multisubtest battery, so that
Development. As with other major tests of intel- its format is quite suitable for profile analysis. In
ligence, the development of the K-ABC used a fact, A. S. Kaufman and N. L. Kaufman (1983)
wide variety of pilot studies and evaluative pro- provide in the test manual lists of abilities associ-
cedures. Over 4,000 protocols were administered ated with specific combinations of subtests. For
as part of this development. As with other major example, attention to visual detail can be assessed
tests of intelligence, short forms of the K-ABC by a combination of three subtests: Gestalt
have been developed for possible use when a gen- Closure, Matrix Analogies, and Photo Series.
eral estimate of mental functioning is needed in
a short time period (A. S. Kaufman & Applegate, Administration. Like the Stanford-Binet and the
1988). Wechsler tests, the K-ABC requires a trained
examiner. Administration time varies from about
Description. The K-ABC is an individually 40 to 45 minutes for younger children, to 75 to
administered intelligence and achievement mea- 85 minutes for older children.
sure that assesses styles of problem solving and
information processing, in children ages 21/2 to
Scoring. All K-ABC scales yield standard scores
121/2. It is composed of five global scales: (1)
with mean of 100 and SD of 15; the subtests yield
Sequential processing scale; (2) Simultaneous
scores with mean of 10 and SD of 3. This was
processing scale; (3) Mental processing compos-
purposely done to permit direct comparison of
ite scale, which is a combination of the first
scores with other tests such as the WISC.
two; (4) Achievement scale; and (5) Nonverbal
scale. The actual battery consists of 16 subtests,
including 10 that assess a child’s sequential and Reliability. Split-half reliability coefficients
simultaneous processing and 6 that evaluate a range from .86 to .93 for preschool children,
child’s achievement in academic areas such as and from .89 to .97 for school-age children for
reading and arithmetic; because not all subtests the various global scales. Test-retest coefficients,
cover all ages, any individual child would at the based on 246 children retested after 2 to 4 weeks,
most be administered 13 subtests. The 10 subtests yielded stability coefficients in the .80s and low
that assess the child’s processing include practice .90s, with stability increasing with increasing
items so that the examiner can communicate to age. As mentioned earlier, specific abilities are
the child the nature of the task and can observe said to be assessed by specific combinations
whether the child understands what to do. Of of subtests, so a basic question concerns the
these 10 subtests, 7 are designed to assess simul- reliability of such composites; the literature
taneous processing, and 3 sequential processing; suggests that they are quite reliable, with typical
the 3 sequential processing subtests all involve coefficients in the mid .80 to mid .90 range (e.g.,
short-term memory. Siegel & Piotrowski, 1985).
All the items in the first three global scales min-
imize the role of language and acquired facts and Validity. The K-ABC Interpretive Manual (A. S.
skills. The Achievement scale assesses what a child Kaufman & N. L. Kaufman, 1983) presents the
has learned in school; this scale uses items that are results of more than 40 validity studies, and gives
more traditionally found on tests of verbal intelli- substantial support to the construct, concurrent,
gence and tests of school achievement. The Non- and predictive validity of the battery. These stud-
verbal scale is an abbreviated version of the Men- ies were conducted on normal samples as well
tal Processing composite scale, and is intended to as on special populations such as learning dis-
assess the intelligence of children with speech or abled, hearing impaired, educable and trainable
language disorders, with hearing impairments, mentally retarded, physically handicapped, and
or those that do not speak English; all tasks on gifted.
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

120 Part Two. Dimensions of Testing

For normal samples, correlations between the proportion of minority gifted children. For a
K-ABC and the Stanford-Binet range from .61 to thorough review of the K-ABC see Kamphaus
.86, while with the WISC-R they center around and Reynolds (1987). It would seem that the K-
.80 (A. S. Kaufman & N. L. Kaufman, 1983). For ABC would be particularly useful in the study of
example, the Mental Processing Composite score children with such problems as attention-deficit
correlates about .70 with the Full Scale IQ of the disorder, but the restricted available data is not
WISC-R. You recall that by squaring this coeffi- fully supportive (e.g., Carter, Zelko, Oas, et al.,
cient, we obtain an estimate of the “overlap” of the 1990).
two scales; thus these scales overlap about 50%,
indicating a substantial overlap, but also some Criticisms. The K-ABC has been criticized on a
uniqueness to each measure. number of issues including its validity for minor-
The K-ABC Achievement scale correlates in the ity groups (e.g., Sternberg, 1984), its appropriate-
.70s and .80s with overall levels of achievement ness for preschool children (e.g., Bracken, 1985),
as measured by various achievement batteries. its theoretical basis (e.g., Jensen, 1984), and
Other sources also support the validity of the its lack of instructional utility (Good, Vollmer,
K-ABC (e.g., Reynolds & Kamphaus, 1997). Creek, et al., 1993).

Norms. The K-ABC was standardized on a Aptitude by treatment interaction. One of the
nation wide stratified sample of 2,000 chil- primary purposes of the K-ABC was not only to
dren, from age 21/2 to 12 years, 5 months, 100 make a classification decision (e.g., this child has
at each 6-month interval. A special effort was an IQ of 125 and therefore should be placed in
made to include children from various ethnic an accelerated class) but also to be used in a diag-
backgrounds and children in special education nostic or prescriptive manner to improve student
programs, including children with mental dis- academic outcomes. Specifically, if a child is most
abilities and gifted. Special norms are provided efficient in learning by sequential processing
for black and white children, separately, from dif- than by simultaneous processing, then that child
ferent socioeconomic backgrounds as defined by ought to be instructed by sequential-processing
parents’ education. procedures. Generically, this instructional model
is called aptitude by treatment interaction.
Interesting aspects. One of the interesting
aspects of the K-ABC is that race differences
The Structure of Intellect Learning
on the battery, while they exist, are substantially
Abilities Test (SOI-LA)
smaller in magnitude than those found on the
WISC-R. Typical differences between black and The SOI-LA (M. Meeker, R. J. Meeker, & Roid,
white children are about 7 points on the K-ABC 1985) is a series of tests designed to assess up to
but about 16 points on the WISC-R. 26 cognitive factors of intelligence in both chil-
Several investigators (e.g., Barry, Klanderman, dren and adults. Its aim is to provide a profile of a
& Stipe, 1983; McCallum, Karnes, & Edwards, person’s cognitive strengths and weaknesses. The
1984; Meador, Livesay, & Finn, 1983) have com- SOI-LA is based on Guilford’s structure of intel-
pared the K-ABC, the WISC-R, and the Stanford- lect model that postulates 120 abilities, reduced
Binet in gifted children and have found that the to 26 for this series. These 26 subtests yield a total
K-ABC yields lower mean scores than the other of 14 general ability scores. The 26 dimensions do
two tests – results that are also found in chil- cover the five operations described by Guilford –
dren who are not gifted. The indications are that namely, cognition, memory, evaluation, conver-
the K-ABC minimizes expressive and verbal rea- gent production, and divergent production.
soning skills, as it was intended to. One practi-
cal implication of this is that a school that uses Description. There are seven forms of the SOI-
one of these tests as part of a decision-making LA available. Form A is the principal form, and
assessment for identifying and placing gifted chil- Form B is an alternate form. Form G is a gifted
dren, will identify fewer such children if they screening form. Form M is for students having
use the K-ABC, but may well identify a greater difficulties with math concepts and is composed
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 121

of 12 subtests from form A that are related to tunately, only 3 of the 26 subtests achieve ade-
arithmetic, mathematics, and science. Form R quate alternate-form reliability (J. A. Cummings,
is composed of 12 subtests from form A that 1989). These three subtests, incidentally, are
are related to reading, language arts, and social among those that have the highest test-retest
science. Form P is designed for kindergarten reliabilities.
through the third grade. Finally, form RR is For the two subtests that require subjective
a reading readiness form designed for young scoring, interscorer reliability becomes impor-
children and “new readers.” tant. Such interscorer reliability coefficients
range from .75 to .85 for the DPFU subtest, and
Administration. The SOI-LA may be adminis- from .92 to 1.00 for the DPSU subtest (M. Meeker,
tered individually or in a group format. Forms A R. J. Meeker, & Roid, 1985). These are rather high
and B each require 21/2 to 3 hours to administer, coefficients, not usually found this high in tests
even though most of the subtests are 3 to 5 min- where subjective scoring is a major aspect.
utes in length. The test manual recommends that
two separate testing sessions be held. There are
Standardization. The normative sample con-
clear instructions in the manual for the admin-
sisted of 349 to 474 school children in each
istration of the various subtests, as well as direc-
of five grade levels, from grades 2 to 6, with
tions for making sure that students understand
roughly equivalent representation of boys and
what is requested of them. In general, the SOI-
girls. Approximately half of the children came
LA is relatively easy to administer and to score,
from California, and the other half from school
and can easily be done by a classroom teacher or
districts in three states. For the intermediate lev-
aide.
els, samples of children in grades 7 to 12 were
assessed; while adult norms are based on various
Scoring. The directions for scoring the subtests
groups aged 18 to 55. Little information is given
are given in the manual and are quite clear and
on these various samples.
detailed. Most of the subtests can be scored objec-
tively, that is, there is a correct answer for each
item. Two of the subtests, however, require sub- Diagnostic and prescriptive aspects. The sub-
jective scoring. In one subtest, Divergent Produc- tests are timed so that the raw scores can be com-
tion of Figural Units (DPFU), the child is required pared with norms in a meaningful way. How-
to complete each of 16 squares into something ever, subjects may be given additional time for
different. In the second subtest, the Divergent uncompleted items, although they need to indi-
Production of Semantic Units (DPSU), the child cate where they stopped when time was called so
is asked to write a story about a drawing from the the raw score can be calculated. Thus, two sets
previous subtest. of scores can be computed: One set is obtained
under standard administrative procedures and
Reliability. Test-retest reliability, with a 2- to 4- therefore comparable to norms, and another set
week interval, ranges from .35 to .88 for the 26 reflects ability with no time limit and potentially
subtests, with a median coefficient of .57, and is useful for diagnostic purposes or for plan-
only 4 of the 26 coefficients are equal to or exceed ning remedial action. Whether this procedure
.75 (J. A. Cummings, 1989). From a stability- is indeed valid remains to be proven, but the
over-time perspective, the SOI-LA leaves much distinction between “actual performance” and
to be desired. Because the SOI-LA subtests are “potential” is an intriguing one, used by a num-
heavily speeded, internal consistency reliability is ber of psychologists.
not appropriate, and of course, the manual does There is available a teacher’s guide that goes
not report any. Internal consistency reliability is along with the SOI-LA, whose instructional focus
based on the consistency of errors made in each is on the remedial of deficits as identified by
subpart of a test, but in a speeded test the consis- the test (M. Meeker, 1985). This represents a
tency is of rapidity with which one works. somewhat novel and potentially useful approach,
Because there are two equivalent forms, although evidence needs to be generated that
alternate-form reliability is appropriate. Unfor- such remedial changes are possible.
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

122 Part Two. Dimensions of Testing

Criticisms. The SOI-LA represents an interesting (b) foot is to leg


approach based on a specific theory of intellectual
(c) ear is to mouth
functioning. One of the major criticisms of this
test, expressed quite strongly in the MMY reviews (d) hand is to finger
(Coffman, 1985; Cummings, 1989), is that the
low reliabilities yield large standard errors of Quantitative items are given as two quantita-
measurement, which means that, before we can tive expressions, and the student needs to decide
conclude that Nadia performed better on one whether the two expressions are equal, if one is
subtest than on another, the two scores need to greater, or if insufficient information is given.
differ by a substantial amount. Because the SOI- Thus, two circles of differing size might be given,
LA is geared at providing a profile that is based and the student needs to determine whether the
on subtests differences, this is a rather serious radius of one is larger than that of the other.
criticism and major limitation. Other criticisms
include the lack of representativeness of the stan- Administration. The SCAT III is a group-
dardization sample, not to mention the dearth of administered test and thus requires no special
empirical validity data. clinical skills or training. Clear instructions are
given in the test manual, and the examiner needs
The School and College Ability Tests, to follow these.
III (SCAT III)
Scoring. When the SCAT III is administered, it
A number of tests developed for group
administration, typically in school settings, is often administered to large groups, perhaps
are designed to assess intellectual competence, an entire school or school system. Thus, provi-
broadly defined. The SCAT III is a typical exam- sions are made by the test publisher for having the
ple of such a test. The SCAT III is designed to mea- answer sheets scored by machine, and the results
sure academic aptitude by assessing basic verbal reported back to the school. These results can be
and quantitative abilities of students in grades 3 reported in a wide variety of ways including SCAT
through 12; an earlier version, the SCAT II, went raw scores, standard scores, percentile ranks, or
up to grades 13 and 14. There are two forms of stanines. For each examinee, the SCAT III yields
the SCAT III, each with three levels, for use in ele- 3 scores: a Verbal score, a Quantitative score, and
mentary grades (grades 3.5 to 6.5), intermediate a Total score.
grades (6.5 to 9.5), and advanced grades (9.5 to
12.5). Unlike achievement tests that measure the Validity. When the SCAT III test was standard-
effect of a specified set of instructions, such as ized, it was standardized concurrently with an
elementary French or introductory algebra, the achievement test battery known as the Sequen-
SCAT III is designed to assess the accumulation tial Tests of Educational Progress, or STEP. SCAT
of learning throughout the person’s life. III scores are good predictors of STEP scores;
The SCAT III was standardized and normed in that is, we have an aptitude test (the SCAT III)
1977–1978 and was published in 1979. Its prede- that predicts quite well how a student will do in
cessor, the SCAT II was originally developed in school subjects, as assessed by an achievement
1957, and was normed and standardized in 1966, test, the STEP. From the viewpoint of school
and renormed in 1970. personnel, this is an attractive feature, in that
the SCAT and the STEP provide a complete test
Description. Each level of the SCAT III contains package from one publisher. Yet one can ask
100 multiple-choice test items, 50 verbal in con- whether in fact two tests are needed – perhaps
tent and 50 quantitative. The verbal items consist we need only be concerned with actual achieve-
of analogies given in a multiple choice format. ment, not also with potential aptitude. We can
For example: also wonder why might it be important to pre-
dict scores on achievement tests; might a more
arm is to hand as:
meaningful target of prediction be actual class-
(a) head is to shoulder room achievement?
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 123

There are a number of studies that address the intellectual operations of cognition, conver-
the validity of the SCAT III and its earlier forms gent thinking, and evaluation.
(e.g., Ong & Marchbanks, 1973), but a surpris-
ingly large number of these seem to be unpub- Description. The test authors began with an ini-
lished masters theses and doctoral dissertations, tial pool of 1,500 items and administered these,
not readily available to the average user. It is also in subsets, to nearly 55,000 students. Inciden-
said that SCAT scores in grades 9 through 12 can tally, this illustrates a typical technique of test
be used to estimate future performance on the construction. When the initial pool of items is
Scholastic Aptitude Test (SAT), which is not sur- too large to administer to one group, subsets of
prising because both are aptitude tests heavily items are constructed that can be more conve-
focusing on school-related abilities. The SCAT niently administered to separate groups. In the
III has been criticized for the lack of information OLSAT, those items that survived item difficulty
about its validity (Passow, 1985). and item discrimination analyses were retained.
In addition, all items were reviewed by minority
Norms. Norms were developed using four educators and were analyzed statistically to assess
variables: geographical region, urban versus those items that might be unfair or discriminate
rural, ethnicity, and socioeconomic status. In against minority group members; items that did
addition to public schools, Catholic and indepen- not meet these criteria were eliminated.
dent schools were also sampled, although sepa-
rate norms for these groups are not given. Sep- Reliability. Internal consistency coefficients are
arate gender norms are also not given, so it may reported for the OLSAT, with rather large sam-
well be that there are no significant gender differ- ples of 6,000 to 12,000 children. The K-R coeffi-
ences. This illustrates a practical difficulty for the cients range from .88 to .95, indicating that the
potential test user. Not only is test information OLSAT is a homogeneous measure and internally
quite often fragmentary and/or scattered in the consistent. Test-retest correlation coefficients are
literature, but one must come to conclusions that also given for smaller but still sizable samples,
may well be erroneous. in the 200 to 400 range, over a 6-month period.
Obtained coefficients range from .84 to .92. Retest
over a longer period of 3 to 4 years, yielded lower
The Otis-Lennon School Ability
correlation coefficients of .75 to .78 (Dyer, 1985).
Test (OLSAT)
The standard error of measurement for this test
Another example of a group intelligence test, also is reported to be about 4 points.
used quite frequently in school systems, is the
OLSAT (often called the Otis-Lennon). This test Validity. Oakland (1985a) indicates that the
is a descendant of a series of intelligence tests OLSAT appears to have suitable content validity,
originally developed by Arthur Otis. In the earlier based on an evaluation of the test items, the test
forms, Otis attempted to use Binet-type items format, the directions, and other aspects. Com-
that could be administered in a group situation. parisons of the OLSAT with a variety of other
There are two forms of the OLSAT, forms R and measures of scholastic aptitude, achievement test
S, with five levels: primary level for grade 1; pri- scores, and intelligence test scores indicates mod-
mary II level for grades 2 and 3; elementary level erate to high correlations in the .60 to .80 range,
for grades 4 and 5; intermediate level for grades with higher correlations with variables that assess
6 to 8; and advanced level for grades 9 through verbal abilities. Construct validity is said to be
12. The OLSAT is based on a hierarchical theory largely absent (Dyer, 1985; Oakland, 1985a). In
of intelligence, which views intelligence as com- fact, while the test is praised for its psychomet-
posed, at one level, of two major domains: verbal- ric sophistication and standardization rigor, it is
educational and practical-mechanical group fac- criticized for the lack of information on validity
tors. The OLSAT is designed to measure only (Dyer, 1985).
the verbal-educational domain. The test was also
influenced by Guilford’s structure of intellect Norms. The OLSAT was standardized in the Fall
model in that items were selected so as to reflect of 1977 through the assessment of some 130,000
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

124 Part Two. Dimensions of Testing

pupils in 70 different school systems, includ- ing abilities, comprehension, and judgment, in a
ing both public and private schools. The sample global way (Slosson, 1991).
was stratified using census data on several vari-
ables, including geographic region of residence Administration. The test can be administered
and socioeconomic status. The racial-ethnic dis- by teachers and other individuals who may not
tribution of the sample also closely paralleled the have extensive training in test administration.
census data and included 74% white, 20% black, The average test-taking time is about 10 to 15
4% Hispanic, and 2% other. minutes.
Normative data are reported by age and grade
using deviation IQs which in this test are called Scoring. Scoring is quite objective and requires
School Ability Indexes, as well as percentiles and little of the clinical skills needed to score a
stanines. The School Ability Index is normed with Stanford-Binet or a Wechsler test. The raw score
a mean of 100 and SD of 16. yields a mental age that can then be used to
calculate a ratio IQ using the familiar ratio of
MA/CA × 100, or a deviation IQ through the
The Slosson Intelligence Test (SIT)
use of normative tables.
There are a number of situations, both research
and applied, where there is need for a “quickie” Reliability. The test-retest reliability for a sample
screening instrument that is easy to administer of 139 persons, ages 4 to 50, and retested over a
in a group setting, does not take up much time 2-month interval, is reported to be .97. For the
for either administration or scoring, and yields SIT-R, a test-retest with a sample of 41 subjects
a rough estimate of a person’s level of general retested after 1 week, yielded a coefficient of .96.
intelligence. These situations might involve iden-
tifying subjects that meet certain specifications Validity. Correlations between the SIT and the
for a research study, or possible candidates for Stanford-Binet are in the mid .90s, with the WISC
an enrichment program in primary grades, or in the mid .70s, and with various achievement
potential candidates for a college fellowship. tests in the .30 to .50 range. For the SIT-R, sev-
There are a number of such instruments avail- eral studies are reported in the test manual that
able, many of dubious utility and validity, that compare SIT-R scores with Wechsler scores, with
nevertheless are used. The SIT is probably typi- typical correlation coefficients in the low .80s.
cal of these. A typical study is that of Grossman and John-
son (1983) who administered the SIT and the
Description. The SIT is intended as a brief Otis-Lennon Mental Ability Test (a precursor of
screening instrument to evaluate a person’s intel- the OLSAT), to a sample of 46 children who were
lectual ability, although it is also presented by its candidates for possible inclusion in an enrich-
author as a “parallel” form for the Stanford-Binet ment program for the gifted. Scores on the two
and was in fact developed as an abbreviated ver- tests correlated .94. However, the mean IQ for the
sion of the Stanford-Binet (Slosson, 1963). The SIT was reported to be 127.17, while for the Otis-
SIT was first published in 1961 and revised in Lennon it was 112.69. Although both tests were
1981, although no substantive changes seem to normalized to the same scale, with mean of 100
have been made from one version to the other. and SD of 16, note the substantially higher mean
It was revised again in 1991, a revision in which on the SIT. If nothing else, this indicates that
items were added that were more similar to the whenever we see an IQ reported for a person we
Wechsler tests than to the Stanford-Binet. This ought to also know which test was used to com-
latest version was called the SIT-R. The test con- pute this – and we need to remind ourselves that
tains 194 untimed items and is said to extend the IQ is a property of the test rather than the per-
from age 2 years to 27 years. No theoretical ratio- son. In the same study, both measures correlated
nale is presented for this test, but because it in the .90s with scores on selected subtests (such
originally was based on the Stanford-Binet, pre- as Vocabulary and Reading Comprehension) of
sumably it is designed to assess abstract reason- the Stanford Achievement Test, a battery that is
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

Cognition 125

commonly used to assess the school achievement theoretical and at the applied level, especially in
of children. studies related to reading ability.
The STT is made up of 180 items that use all
Norms. The norms for the SIT are based on a the eight possible combinations of the letters a
sample of 1,109 persons, ranging from 2 to 18. and b, with one letter in upper case and one in
These were all New England residents, but infor- lower case. There is a practice test that is admin-
mation on gender, ethnicity, or other aspects is istered first. Both the practice and the actual test
not given. Note should be made that the mean have a 2-minute time limit each. Thus the entire
of the SIT is 97 and its SD is 20. This larger SD procedure including distribution of materials in a
causes severe problems of interpretation if the SIT group setting, and instructions requires less than
is in fact used to make diagnostic or placement 10 minutes.
decisions (W. M. Reynolds, 1979). The STT was administered, along with other
The SIT-R was standardized on a sample of instruments, to 129 college students enrolled in
1,854 individuals, said to be somewhat represen- college reading and study skills courses. The test-
tative of the U.S. population in educational and retest reliability, with a 2-week interval, was .80.
other characteristics. Scores on the STT correlated .60 with another
measure designed to assess silent reading rate,
and .26 (significant but low) with a measure of
Criticisms. Although the SIT is used relatively reading rate, but did not correlate significantly
frequently in both the research literature and in with two measures of vocabulary level. Thus both
applied situations, it has been severely criticized convergent and discriminant validity seem to be
for a narrow and unrepresentative standardiza- supported. Obviously, much more information
tion sample, for lack of information on reliability is needed.
and validity, for its suggested use by untrained
examiners, which runs counter to APA profes-
sional ethics, and for its unwarranted claims of
equivalence with the Stanford-Binet (Oakland, SUMMARY
1985b; W. M. Reynolds, 1985). In summary, the In this chapter, we briefly looked at various theo-
SIT is characterized as a psychometrically poor ries of cognitive assessment and a variety of issues.
measure of general intelligence (W. M. Reynolds, We only scratched the surface in terms of the vari-
1985). ety of points of view that exist, and in terms of
the controversies about the nature and nurture
of intelligence. We looked, in some detail, at the
The Speed of Thinking Test (STT)
various forms of the Binet tests, because in some
So far we have looked at measures that are mul- ways, they nicely illustrated the historical pro-
tivariate, that assess intelligence in a very com- gression of “classical” testing. We also looked at
plex way, either globally or explicitly composed the Wechsler series of tests because they are quite
of various dimensions. There are however, lit- popular and also illustrate some basic principles
erally hundreds of measures that assess specific of testing. The other tests were chosen as illus-
cognitive skills or dimensions. The STT is illus- trations, some because of their potential utility
trative. (e.g., the BAS), or because they embody an inter-
Carver (1992) presented the STT as a test to esting theoretical perspective (e.g., the SOI-LA),
measure cognitive speed. The STT is designed to or because they seem to be growing in usefulness
measure how fast individuals can choose the cor- and popularity (e.g., the K-ABC). Some, such as
rect answers to simple mental problems. In this the SIT, leave much to be desired, and others,
case, the problems consist of pairs of letters, one such as the Otis-Lennon, seem to be less used
in upper case and one in lower case. The respon- than in the past. Tests, like other market products,
dent needs to decide whether the two letters are achieve varying degrees of popularity and com-
the same or different – e.g., Aa vs. aB. Similar mercial success, but hopefully the lessons they
tasks have been used in the literature, both at the teach us will outlast their use.
P1: JZP
0521861810c05 CB1038/Domino 0 521 86181 0 March 4, 2006 14:17

126 Part Two. Dimensions of Testing

SUGGESTED READINGS practice items lower reliability? (4) Why are there more simul-
taneous than sequential subtests? (5) Is the K-ABC a replace-
Byrd, P. D., & Buckhalt, J. A. (1991). A multitrait- ment for the WISC-R?
multimethod construct validity study of the Differen-
tial Ability Scales. Journal of Psychoeducational Assess- Keating, D. P. (1990). Charting pathways to the devel-
ment 9, 121–129. opment of expertise. Educational Psychologist, 25, 243–
267.
You will recall that we discussed the multitrait-multimethod
design as a way of assessing construct validity, and more A very theoretical article that first briefly reviews the history
specifically, as a way of obtaining convergent and discrim- of the conception of intelligence and then engages in some
inant validity information. This study of 46 rural Alabama speculative thinking. The article introduces “Alfreda” Binet,
children analyzes scores from the DAS, the WISC-R, and the the mythical twin sister of Alfred Binet, who might have done
Stanford Achievement Test. The authors conclude that one things quite differently from her famous brother.
must be careful in comparing subtests from the DAS and
the WISC-R, even though they may have similar content. Weinberg, R. A. (1989). Intelligence and IQ. American
One may well ask whether this article represents an accurate Psychologist, 44, 98–104.
utilization of the multitrait-multimethod approach – are the A brief overview of the topic of intelligence, some of the
methods assessed really different? controversies, and some of the measurement issues.
Frederiksen, N. (1986). Toward a broader conception
of human intelligence. American Psychologist, 41, 445– DISCUSSION QUESTIONS
452.
1. Do you agree that “intelligent behavior can be
The author argues that current models of intelligence are
observed”? What might be some of the aspects of
limited because they do not simulate real-world problem sit-
uations, and he reviews a number of studies that do simulate such behavior?
real-world problems. 2. Which of the six metaphors of intelligence
Kaufman, A. S. (1983). Some questions and answers makes most sense to you?
about the Kaufman Assessment Battery for Children 3. What are some of the reasons why intelligence
(K-ABC). Journal of Psychoeducational Assessment, 1, tests are not good predictors of college GPA?
205–218. 4. How is the validity of an intelligence test such
A highly readable overview of the K-ABC written by its senior as the Stanford-Binet IV established?
author. In addition to a description of the battery, the author
covers five basic questions: (1) Why was the age range of 5. Discuss the validity of any intelligence test
21/2 to 121/2 years selected? (2) Does the Mental Process- in the primary-secondary-tertiary framework we
ing Composite Scale predict future achievement? (3) Do the discussed in Chapter 3.
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

6 Attitudes, Values, and Interests

AIM This chapter looks at the measurement of attitudes, values, and interests. These
three areas share much in common from a psychometric as well as a theoretical point
of view; in fact, some psychologists argue that the three areas, and especially attitudes
and values, are not so different from each other. Some authors regard them as subsets
of personality, while others point out that it is difficult, if not impossible, to define
these three areas so that they are mutually exclusive.

The measurement of attitudes has been a central experts in this field agree as to what is and what
topic in social psychology, but has found rela- is not an attitude. For our purposes however,
tively little application in the assessment of the we can consider attitudes as a predisposition to
individual client. Interest measurement on the respond to a social object, such as a person,
other hand, particularly the assessment of career group, idea, physical object, etc., in particular
interests, probably represents one of the most situations; the predisposition interacts with other
successful applications of psychological testing variables to influence the actual behavior of a
to the individual client. The assessment of val- person (Cardno, 1955).
ues has had somewhat of a mixed success, with Most discussions and/or definitions of atti-
such assessment often seen as part of personality tude involve a tripartite model of affect, behav-
and/or social psychology, and with some indi- ior, and cognition. That is, attitudes considered
vidual practitioners believing that values are an as a response to an object have an emotional
important facet of a client’s assessment. component (how strongly one feels), a behav-
In the area of attitudes we look at some general ioral component (for example, voting for a
issues, some classical ways of developing attitude candidate; shouting racial slurs; arguing about
scales, and some other examples to illustrate vari- one’s views), and a cognitive (thinking) compo-
ous aspects. In the area of values, we look at two of nent (e.g., Insko & Schopler, 1967; Krech, Crutch-
the more popular measures that have been devel- field, & Ballachey, 1962). These three compo-
oped, the Study of Values and the Rokeach Value nents should converge (that is, be highly simi-
Survey. Finally, in the area of interest measure- lar), but each should also contribute something
ment, we focus on career interests and the two unique, and that indeed seems to be the case
sets of tests that have dominated this field, the (e.g., Breckler, 1984; Ostrom, 1969; Rosenberg,
Strong and the Kuder. Hovland, McGuire, et al., 1960). This tripartite
model is the “classical” model that has guided
much research, but it too has been criticized and
ATTITUDES
new theoretical models proposed (e.g., Cacioppo,
Definition. Once again, we find that there are Petty, & Geen, 1989; Pratkanis & Greenwald,
many ways of defining attitudes and not all 1989; Zanna & Rempel, 1988).

127
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

128 Part Two. Dimensions of Testing

Some writers seem to emphasize one compo- Ways of studying attitudes. There are many
nent more than the others. For example, Thur- ways in which attitudes can be measured or
stone (1946) defined attitude as, “the degree of assessed. The first and most obvious way to learn
positive or negative affect associated with some what a person’s attitude is toward a particular
psychological object.” But most social scientists issue is to ask that person directly. Everyday con-
do perceive attitudes as learned predispositions versations are filled with this type of assessment,
to respond to a specific target, in either a pos- as when we ask others such questions as “How
itive or negative manner. As in other areas of do you feel about the death penalty?” “What do
assessment, there are a number of theoretical you think about abortion?” and “Where do you
models available (e.g., Ajzen & Fishbein, 1980; stand on gay rights?” This method of self-report
Bentler & Speckart, 1979; Dohmen, Doll, & is simple and direct, can be useful under some
Feger, 1989; Fishbein, 1980; Jaccard, 1981; Trian- circumstances, but is quite limited from a psy-
dis, 1980; G. Wiechmann & L. A. Wiechmann, chometric point of view. There may be pressures
1973). to conform to majority opinion or to be less than
candid about what one believes. There may be
a confounding of expressed attitude with verbal
Centrality of attitudes. The study of attitudes
skills, shyness, or other variables. A. L. Edwards
and attitude change have occupied a central posi-
(1957a) cites a study in which college students
tion in the social sciences, and particularly in
interviewed residents of Seattle about a pending
social psychology, for a long time. Even today,
legislative bill. Half of the residents were asked
the topic is one of the most active topics of
directly about their views, and half were given
study (Eagly & Chaiken, 1992; Oskamp, 1991;
a secret and anonymous ballot to fill out. More
Rajecki, 1990). Part of the reason why the study
“don’t know” responses were obtained by direct
of attitudes has been so central focuses on the
asking, and more unfavorable responses were
assumption that attitudes will reveal behavior
obtained through the secret ballot. The results of
and because behavior seems so difficult to assess
the secret ballot were also in greater agreement
directly, attitudes are assumed to provide a way
with actual election results held several weeks
of understanding behavior (Kahle, 1984). Thus
later.
the relationship between attitudes and behavior
There are other self-reports, and these can
is a major question, with some writers question-
include surveys, interviews, or more “personal”
ing such a relationship (e.g., Wicker, 1969) and
procedures such as keeping a log or journal. Self-
others proposing that such a relationship is mod-
reports can ordinarily be used when the respon-
erated by situational or personality factors (e.g.,
dents are able to understand what is being asked,
Ajzen & Fishbein, 1973; Zanna, Olson, & Fazio,
can provide the necessary information, and are
1980).
likely to respond honestly.

Some precautions. Henerson, Morris, and Fitz-


Gibbon (1987) suggest that in the difficult task Observing directly. Another approach to the
of measuring attitudes, we need to keep in mind study of attitudes is to observe a person’s behav-
four precautions: ior, and to infer from that behavior the person’s
attitudes. Thus, we might observe shoppers in a
1. Attitudes are inferred from a person’s words
grocery store to determine their attitudes toward
and actions; thus, they are not measured directly.
a particular product. The problem of course, is
2. Attitudes are complex; feelings, beliefs, and that a specific behavior may not be related to a
behaviors do not always match. particular attitude (for a brief, theoretical dis-
3. Attitudes may not necessarily be stable, and so cussion of the relationship between attitudes and
the establishment of reliability, especially when observable behavior see J. R. Eiser, 1987). You
viewed as consistency over time, can be prob- might buy chicken not because you love chicken
lematic. but because you cannot afford filet mignon,
4. Often we study attitudes without necessarily or because you might want to try out a new
having uniform agreement as to their nature. recipe, or because your physician has suggested
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 129

less red meat. Such observer-reports can include I would like to work with:
a variety of procedures ranging from observa- I would like to be on the same team as:
tional assessment, to interviews, questionnaires,
logs, etc. This approach is used when the peo- In general, it is recommended that sociometric
ple whose attitudes are being investigated may items be positive rather than negative and gen-
not be able to provide accurate information, or eral rather than specific (see Gronlund, 1959, for
when the focus is directly on behavior that can information on using and scoring sociometric
be observed, or when there is evidence to sug- instruments).
gest that an observer will be less biased and more
objective. Records. Sometimes, written records that are
kept for various purposes (e.g., school attendance
Assessing directly. Because of the limitations records) can be analyzed to assess attitudes, such
inherent in both asking and observing, attitude as attitudes toward school or a particular school
scales have been developed as a third means of subject.
assessing attitudes. An attitude scale is essentially
a collection of items, typically called statements, Why use rating scales? Given so many ways of
which elicit differential responses on the part of assessing attitudes, why should rating scales be
individuals who hold different attitudes. As with used? There are at least six major reasons offered
any other instrument, the attitude scale must be in the literature: (1) attitude rating scales can
shown to have adequate reliability and validity. be administered to large groups of respondents
We will return to attitude scales below. at one sitting; (2) they can be administered
under conditions of anonymity; (3) they allow
Sociometric procedures. Mention should be the respondent to proceed at their own pace;
made here of sociometric procedures, which have (4) they present uniformity of procedure; (5) they
been used to assess attitudes, not so much toward allow for greater flexibility – for example, take-
an external object, but more to assess the social home questionnaires; and (6) the results are more
patterns of a group. Thus, if we are interested amenable to statistical analyses.
in measuring the social climate of a classroom At the same time, it should be recognized that
(which children play with which children; who their strengths are also their potential weaknesses.
are the leaders and the isolates, etc.), we might Their use with large groups can preclude obtain-
use a sociometric technique (for example, hav- ing individualized information or results that
ing each child identify their three best friends may suggest new avenues of questioning.
in that classroom). Such nominations may well
reflect racial and other attitudes. Sociometric
techniques can also be useful to obtain a base rate Ways of Measuring Attitudes
reading prior to the implementation of a pro- The method of equal-appearing intervals. This
gram designed to change the group dynamics, or method, also known as the Thurstone method
to determine whether a particular program has after its originator (Thurstone & Chave, 1929), is
had an effect. There are a wide variety of socio- one of the most common methods of developing
metric measures, with two of the more popular attitude scales and involves the following steps:
consisting of peer ratings and social choices. In
the peer rating method, the respondent reads a 1. The first step is to select the social object or
series of statements and indicates to whom the target to be evaluated. This might be an individ-
statement refers. For example: ual (the President), a group of people (artists), an
idea or issue (physician-assisted suicide), a phys-
this child is always happy. ical object (the new library building), or other
this child has lots of friends. targets.
this child is very good at playing sports.
2. Next a pool of items (close to 100 is not
In the social choice method, the respondent uncommon) is generated – designed to repre-
indicates the other persons whom he or she sent both favorable and unfavorable views. An
prefers. For example: assumption of most attitude research is that
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

130 Part Two. Dimensions of Testing

attitudes reflect a bipolar continuum ranging those items the respondent agrees with. The items
from pro to con, from positive to negative. are printed in random order. A person’s score on
3. The items are printed individually on cards, the attitude scale is the median of the scale values
and these cards are then given to a group of of all the items endorsed.
“expert” subjects (judges) who individually sort
For example, let’s assume we have developed
the items into 11 piles according to the degree
a scale to measure attitudes toward the topic of
of favorableness (not according to whether they
“psychological testing.” Here are six representa-
endorse the statement). Ordinarily, items placed
tive items with their medians and Q values:
in the first pile are the most unfavorable, items
in the 6th pile are neutral, and items in the 11th Median Q value
pile are the most favorable. Note that this is very 1. I would rather read 10.5 .68
much like doing a Q sort, but the individual judge about psychological
can place as many items in any one pile as he or testing than anything
she wishes. The judges are usually chosen because else
they are experts on the target being assessed – for 14. This topic makes you 8.3 3.19
example, statements for a religion attitude scale really appreciate the
might be sorted by ministers. complexity of the
4. The median value for each item is then com- human mind
puted by using the pile number. Thus if item #73 19. This is a highly 6.7 .88
is placed by five judges in piles 6, 6, 7, 8, and 9, interesting topic
the median for that item would be 7. Ordinarily 23. Psychological testing 4.8 .52
of course, we would be using a sizable sample of is OK
judges (closer to 100 is not uncommon), and so 46. This topic is very 2.1 .86
the median values would most likely be decimal boring
numbers. 83. This is the worst topic 1.3 .68
5. The median is a measure of central tendency – in psychology
of average. We also need to compute for each item Note that item 14 would probably be elimi-
the amount of variability or of dispersion among nated because of its larger Q value. If the other
scores, the scores again being the pile numbers. items were retained and administered to a sub-
Ordinarily, we might think of computing the ject who endorses items 1, 19, and 23, then that
standard deviation, but Thurstone computed the person’s score would be the median of 10.5, 6.7,
interquartile range, known as Q. The interquar- and 4.8, which would be 6.7.
tile range for an item is based on the difference The intent of this method was to develop an
between the pile values of the 25th and the 75th interval scale, or possibly a ratio scale, but it is
percentiles. This measure of dispersion in effect clear that the zero point (in this case the center
looks at the variability of the middle 50% of the of the distribution of items) is not a true zero.
values assigned by the judges to a particular item. The title “method of equal-appearing intervals”
A small Q value would indicate that most judges suggests that the procedure results in an interval
agreed in their placement of a statement, while a scale, but whether this is so has been questioned
larger value would indicate greater disagreement. (e.g., Hevner, 1930; Petrie, 1969). Unidimension-
Often disagreement reflects a poorly written item ality, hopefully, results from the writing of the ini-
that can be interpreted in various ways. tial pool of items, in that all of the items should
6. Items are then retained that (1) have a wide be relevant to the target being assessed and from
range of medians so that the entire continuum is selecting items with small Q values.
represented and (2) that have the smallest Q val- There are a number of interesting questions
ues indicating placement agreement on the part that can be asked about the Thurstone procedure.
of the judges. For example, why use 11 categories? Why use the
7. The above steps will yield a scale of maybe 15 median rather than the mean? Could the judges
to 20 items that can then be administered to a rate each item rather than sort the items? In gen-
sample of subjects with the instructions to check eral, variations from the procedures originally
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 131

used by Thurstone do not seem to make much it was administered to college students, members
difference (S. C. Webb, 1955). of Young Democrat and Young Republican orga-
One major concern is whether the attitudes of nizations, with Democrats assumed to represent
the judges who do the initial sorting influences the liberal point of view and Republicans the con-
how the items are sorted. At least some studies servative.
have suggested that the attitudes of the judges, Below are representative items from the scale
even if extreme, can be held in abeyance with with the corresponding scale values:
careful instructions, and do not influence the
sorting of the items in a favorable-unfavorable 1. All old people should be taken 2.30
continuum (e.g., Bruvold, 1975; Hinckley, 1932). care of by the government.
Another criticism made of Thurstone scales 10. Labor unions play an essential 4.84
is that the same total score can be obtained role in American democracy.
by endorsing totally different items; one person 16. The federal government should 7.45
may obtain a total score by endorsing one very attempt to cut its annual
favorable item or 9 or 10 unfavorable items that spending.
would add to the same total. This criticism is, 23. Isolation (complete) is the 10.50
of course, not unique to the Thurstone method. answer to our foreign policy.
Note that when we construct a scale we ordinarily Note that the dimension on which the items
assume that there is a continuum we are assessing were sorted was liberal vs. conservative, rather
(intelligence, anxiety, psychopathology, liberal- than pro or con.
conservative, etc.) and that we can locate the The authors report a corrected internal con-
position of different individuals on this contin- sistency coefficient (split-half) of +.79, and a
uum as reflected by their test scores. We ordinarily Guttman reproducibility score of .87 (see follow-
don’t care how those scores are composed – on ing disscussion). The correlation between polit-
a 100-item classroom test, it doesn’t ordinarily ical affiliation and scale score was +.64, with
matter which 10 items you miss, your raw score Young Democrats having a mean score of 4.81
will still be 90. But one can argue that it ought and Young Republicans a mean score of 5.93.
to matter. Whether you miss the 10 most diffi- These two means are not all that different, and
cult items or the 10 easiest items probably says one may question the initial assumption of the
something about your level of knowledge or test- authors that democrats equal liberal and republi-
taking abilities, and whether you miss 10 items cans equal conservative, and/or whether the scale
all on one topic vs. 10 items on 10 different really is valid. Note also that the authors chose
topics might well be related to your breadth of contrasted groups, a legitimate procedure, but
knowledge. one may well wonder whether the scale would dif-
ferentiate college students with different political
Example of a Thurstone scale. J. H. Wright and persuasions who have chosen not to join cam-
Hicks (1966) attempted to develop a liberalism- pus political organizations. Finally, many of the
conservatism scale using the Thurstone method. items on the scale have become outmoded. Per-
This dimension is a rather popular one, and haps more than other measures, attitude scales
several such scales exist (e.g., G. Hartmann, have a short “shelf life,” and rapidly become out-
1938; Hetzler, 1954; Kerr, 1952; G. D. Wilson dated in content, making longitudinal compar-
& Patterson, 1968). The authors assembled 358 isons somewhat difficult.
statements that were sorted into an 11-point
continuum by 45 college students in an exper- The method of summated ratings. This
imental psychology class (could these be con- method, also known as the Likert method after
sidered experts?). From the pool of items, 23 its originator (Likert, 1932), uses the following
were selected to represent the entire continuum sequence of steps:
and with the smallest SD (note that the origi-
nal Thurstone method called for computing the 1. and 2. These are the same as in the Thurstone
interquartile range rather than the SD – but both method, namely choosing a target concept and
are measures of variability), To validate the scale, generating a pool of items.
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

132 Part Two. Dimensions of Testing

3. The items are administered to a sample of sub- and tradition. If the first or major researcher in
jects who indicate for each item whether they one area uses a particular type of scale, quite often
“strongly agree,” “agree,” “are undecided,” “dis- subsequent investigators also use the same type
agree,” or “strongly disagree” (sometimes a word of scale, even when designing a new scale. But the
like “approve” is used instead of agree). Note that issue of how many response categories are best –
these subjects are not experts as in the Thurstone “best” judged by “user-friendly” aspects and by
method; they are typically selected because they reliability and validity – has been investigated
are available (introductory psychology students), with mixed results (e.g., Komorita & Graham,
or they represent the population that eventually 1965; Masters, 1974; Remmers & Ewart, 1941).
will be assessed (e.g., registered Democrats). Probably a safe conclusion here is that there does
4. A total score for each subject can be gener- not seem to be an optimal number, but that five
ated by assigning scores of 5, 4, 3, 2, and 1 to to seven categories seem to be better than fewer
the above categories, and reversing the scoring or more.
for unfavorably worded items; the intent here is In terms of our fourfold classification of nom-
to be consistent, so that ordinarily higher scores inal, ordinal, interval, and ratio scales, Likert
represent a more favorable attitude. scales fall somewhere between ordinal and inter-
5. An item analysis is then carried out by com- val. On the one hand, by adding the arbitrary
puting for each item a correlation between scores associated with each response option, we
responses on that item and total scores on all the are acting as if the scale is an interval scale. But
items (to be statistically correct, the total score clearly the scores are arbitrary – why should the
should be for all the other items, so that the difference between “agree” and “strongly agree”
same item is not correlated with itself, but given a be of the same numerical magnitude as the differ-
large number of items such overlap has minimal ence between “uncertain” and “agree”? And why
impact). should a response of “uncertain” be assigned a
value of 3?
6. Individual items that correlate the highest with
The above two methods are the most com-
the total score are then retained for the final ver-
mon ways of constructing attitude scales. Both
sion of the scale. Note therefore that items could
are based upon what are called psychophysical
be retained that are heterogeneous in content, but
methods, ways of assessing stimuli on the basis
correlate significantly with the total. Conversely,
of their physical dimensions such as weight, but
we could also carry out an item analysis using
as determined psychologically (How heavy does
the method of item discrimination we discussed.
this object feel?). Interested readers should see
Here we could identify the top 27% high scor-
A. L. Edwards (1957a) for a discussion of these
ers and the bottom 27% low scorers, and analyze
methods as related to attitude scale construction.
for each item how these two groups responded to
How do the Thurstone and Likert procedures
that item. Those items that show good discrim-
compare? For example, would a Thurstone scale
ination between high and low scorers would be
of attitudes toward physician assisted suicide cor-
retained.
relate with a Likert scale of the same target? Or
7. The final scale can then be administered to what if we used the same pool of items and scored
samples of subjects and their scores computed. them first using the Thurstone method and then
Such scores will be highly relative in meaning – the Likert method – would the resulting sets of
what is favorable or unfavorable depends upon scores be highly related? In general, studies indi-
the underlying distribution of scores. cate that such scales typically correlate to a fair
degree (in the range of .60 to .95). Likert scales
Note should be made that some scales are typically show higher split-half or test-retest reli-
called Likert scales simply because they use a 5- ability than Thurstone scales. Likert scales are also
point response format, but may have been devel- easier to construct and use, which is why there are
oped without using the Likert procedure, i.e., more of them available (see Roberts, Laughlin, &
simply by the author putting together a set of Wedell, 1999 for more complex aspects of this
items. issue). We now turn to a number of other meth-
Are five response categories the best? To some ods, which though important, have proven less
degree psychological testing is affected by inertia common.
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 133

Ratings
1 2 3 4 5 6 7
Group (each of these would be defined using the seven statements)

Hispanics

American Indians

Blacks

Italians

Russians

etc.

FIGURE 6–1. Example of a Bogardus Scale using multiple targets.

The Bogardus (1925) method. This method was The Bogardus approach is a methodology, but
developed in an attempt to measure attitudes also a unique scale, as opposed to the Thur-
toward different nationalities. Bogardus simply stone and Likert methods, which have yielded
asked subjects to indicate whether they would a wide variety of scales. Therefore, it is appropri-
admit members of a particular nationality or race ate here to mention reliability and validity. New-
to different degrees of social contact as defined by comb (1950) indicated that split-half reliability of
these seven categories: the Bogardus scale typically reaches .90 or higher
and that the validity is satisfactory. There have
1. close kinship by marriage been a number of versions of the Bogardus scale;
2. membership in one’s club (or as close friends) for example, Dodd (1935) developed an equal-
3. live on the same street as neighbor interval version of this scale for use in the Far East,
4. employment in the same occupation (or work while Miller and Biggs (1958) developed a modi-
in same office) fied version for use with children. In general how-
5. citizenship in this country ever, the Bogardus social distance approach has
had limited impact, and its use nowadays seems
6. visitor in this country
to be rare.
7. would exclude from this country

The scale forms a continuum of social distance, Guttman scaling. This method is also known
where at one end a person is willing to accept the as scalogram analysis (Guttman, 1944). There is
target person in a very intimate relationship and little difficulty in understanding the Bogardus
at the other extreme would keep the target person social distance scale, and we can think of the
as far away as possible. The instructions ask the Guttman method as an extension. We can eas-
subject to check those alternatives that reflect his ily visualize how close or far away a particular
or her reaction and not to react to the best or the person might wish to keep from members of a
worst members of the group that the respondent racial group, even though we may not understand
might have known. The score is simply the rank and/or condone racial prejudice. Ordinarily, we
of the lowest (most intimate) item checked. If would expect that if a person welcomes a mem-
the group being assessed is a racial group, such as ber of a different race into their own family, they
Blacks, then the resulting score is typically called a would typically allow that person to work in the
racial distance quotient. Note that multiple ratings same office, and so on. The social distance scale
could be obtained by having a bivariate table, is a univariate scale, almost by definition, where
with one dimension representing racial groups a person’s position on that scale can be defined
and the other dimension representing the seven simply by the point where the person switches
categories. Figure 6.1 illustrates this. response mode. Suppose, for example, I have a
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

134 Part Two. Dimensions of Testing

mild case of racial bias against Venusian Pincos; complicating issues that are beyond the scope of
I would allow them in this country as visitors or this book (e.g., A. L. Edwards, 1957a; Festinger,
citizens, and would not really object to working 1947; Green, 1954; Schuessler, 1961).
with them, but I certainly would not want them Guttman scales are not restricted to social dis-
as neighbors, or close friends, and would sim- tance, but could theoretically be developed to
ply die if my daughter married one of them. My assess any variable. Let’s assume I am work-
point of change is from item 4 to item 3; know- ing with an elderly population, perhaps female
ing that point of change, you could reproduce all clients living in a nursing home, and I wish
my seven responses, assuming I did not reverse to assess their degree of independence as far as
myself. This is in fact what Guttman scaling is food preparation is concerned. I might develop a
all about. In developing a Guttman scale, a set Guttman scale that might look like this:
of items that form a scalable continuum (such This client is able to:
as social distance) is administered to a group of
subjects, and the pattern of responses is analyzed (a) plan and prepare a meal on her own
to see if they fit the Guttman model. As an exam- (b) plan and prepare a meal with some assistance
ple, let’s assume we have only three items: A (on (c) prepare a meal but must be given the
marriage), B (on close friends), and C (on neigh- ingredients
bor), each item requiring agreement or disagree- (d) prepare a meal but needs assistance
ment. Note that with the three items, we could (e) she not prepare a meal on her own
theoretically obtain the following patterns of
response: We can think of reproducibility as reflect-
ing unidimensionality, and Guttman scales are
Item A Item B (close Item C thus unidimensional scales. Note however, that
(marriage) friends) (neighbor) the method does not address the issue of equal
Response Agree Disagree Disagree intervals or the arbitrariness of the zero point;
Agree Agree Disagree thus Guttman scales, despite their methodolog-

Patterns: Agree Agree Agree ical sophistication, are not necessarily interval

Disagree Agree Agree or ratio scales. The Guttman methodology has

Disagree Disagree Agree had more of an impact in terms of thinking

Disagree Disagree Disagree about scale construction than in terms of actual,
Agree Disagree Agree useful scales. Such scales do of course exist,
Disagree Agree Disagree but the majority assess variables that are behav-
ioral in nature (such as the range of move-
In fact, the number of possible response pat- ment or physical skills a person possesses), rather
terns is 2N where N is the number of items; in than variables that are more “psychodynamic.”
this case 23 equals 2 × 2 × 2 or 8. If however, There are a number of other procedures used to
the items form a Guttman scale, there should be develop attitude scales, which, like the Guttman
few if any reversals, and only the four response approach, are fairly complex both in theory
patterns marked by an ∗ should occur. The ideal and in statistical procedures (e.g., Banta, 1961;
number of response patterns then becomes N + Coombs, 1950; Green, 1954; Hays & Borgatta,
1, or 4 in this example. We can then compute what 1954; Lazarsfeld, 1950, 1954, 1959). In fact,
is called the coefficient of reproducibility, which is there seems to be agreement that attitudes are
defined as: multidimensional and that what is needed are
total number of errors more sophisticated techniques than the simple
1− unidimensional approaches of Thurstone and
total number of responses
Likert.
where errors are any deviation from the
“ideal” pattern. If the reproducibility coeffi- The Semantic Differential (SemD). The SemD
cient is .90 or above, then the scale is con- was developed as a way of assessing word mean-
sidered satisfactory. Although the matter seems ing but because this technique has been used
fairly straightforward, there are a number of quite frequently in the assessment of attitudes it
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 135

My ideal self

good bad

small large

beautiful ugly

passive active

sharp dull

slow fast

dirty clean
etc.
FIGURE 6–2. Example of a Semantic Differential Scale.

can legitimately be considered here. The SemD How does one develop a SemD scale? There
is a method of observing and measuring the are basically two steps. The first step is to choose
psychological meaning of things, usually con- the concept(s) to be rated. These might be famous
cepts. We can communicate with one another persons (e.g., Mother Theresa, Elton John), polit-
because words and concepts have a shared mean- ical concepts (socialism), psychiatric concepts
ing. If I say to you, “I have a dog,” you know (alcoholism), therapeutic concepts (my ideal
what a dog is. Yet that very word also has addi- self), cultural groups (Armenians), nonsense syl-
tional meanings that vary from person to per- lables, drawings, photographs, or whatever other
son. One individual may think of dog as warm, stimuli would be appropriate to the area of inves-
cuddly, and friendly while another person may tigation.
think of dog as smelly, fierce, and troublesome. The second step is to select the bipolar adjec-
There are thus at least two levels of meaning tives that make up the SemD. We want the scale to
to words: the denotative or dictionary mean- be short, typically around 12 to 16 sets of bipolar
ing, and the connotative or personal meaning. adjectives, especially if we are asking each respon-
Osgood (Osgood, Suci, & Tannenbaum, 1957) dent to rate several concepts (e.g., rate the fol-
developed the SemD to measure the connota- lowing cities: New York, Rome, Paris, Istanbul,
tive meanings of concepts as points in a seman- Cairo, and Caracas). Which adjectives would we
tic space. That space is three-dimensional, like a use? Bipolar adjectives are selected on the basis
room in a house, and the dimensions, identified of two criteria: factor representativeness and rele-
through factor analysis, are evaluative (e.g., good- vance. Typical studies of the SemD have obtained
bad), potency (e.g., strong-weak), and activity the three factors indicated above, so we would
(fast-slow). Four additional factorial dimensions select four or five bipolar adjectives representa-
have been identified: density (e.g., numerous- tive of each factor; the loadings of each adjective
sparse), orderliness (e.g., haphazard-systematic), pair on the various factor dimensions are given
reality (e.g., authentic-fake), and familiarity (e.g., in various sources (e.g., Osgood, Suci, & Tan-
commonplace-exceptional) (Bentler & LaVoie, nenbaum, 1957; Snider & Osgood, 1969). The
1972; LaVoie & Bentler, 1974). second criterion of relevance is a bit more dif-
The SemD then consists of a series of bipolar ficult to implement. If the concept of Teacher
adjectives separated by a 7-point scale, on which were being rated, one might wish to use bipo-
the respondent rates a given concept. Figure 6.2 lar pairs that are relevant to teaching behavior
gives an example of a SemD. such as organized vs. disorganized, or concerned
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

136 Part Two. Dimensions of Testing

Table 6–1. SemD ratings from one subject for five brands of beer
SemD Scales Brand A Brand B Brand C Brand D Brand E
Pleasant-unpleasant 6 2 6 5 3
Ugly-beautiful 5 2 5 5 2
Sharp-flat 6 1 4 6 2
Salty-sweet 7 1 5 6 3
Happy-sad 5 3 5 7 1
Expensive-cheap 6 2 7 7 2
Mean 5.83 1.83 5.33 6.00 2.17

about students vs. not concerned (note that the various brands of beer. Table 6.1 shows the results
“bipolar adjectives” need not be confined to one from one subject who was asked to rate each of
word). However, other bipolar pairs that on the five brands:
surface may not seem highly relevant, such as For the sake of simplicity, let’s assume that
heavy-light, ugly-beautiful, might in fact turn out the six bipolar pairs are all evaluative items.
to be quite relevant, in distinguishing between A first step would be to compute and com-
students who drop out vs. those who remain in pare the means. Clearly brands A, C, and D
school, for example. are evaluated quite positively, while brands B
In making up the SemD scale, about half of the and E are not. If the means were group aver-
bipolar adjectives would be listed in reverse order ages, we could test for statistical significance per-
(as we did in Figure 6.2) to counteract response haps using an ANOVA design. Note that in the
bias tendencies, so that not all left-hand terms SemD there are three sources of variation in the
would be positive. A 7-point scale is typically raw scores: differences between concepts, differ-
used, although between 3 and 11 spaces have been ences between scales (i.e., items), and differences
used in the literature; with children, a 5-point between respondents. In addition we typically
scale seems more appropriate. have three factors to contend with.

Scoring the SemD. The SemD yields a surprising


Distance-cluster analysis. If two brands of beer
amount of data and a number of analyses are
are close together in semantic space, that is rated
possible. The raw scores are simply the numbers
equivalently, they are alike in “meaning” (for e.g.,
1 through 7 assigned as follows:
brands C and D in Table 6.1). If they are sepa-
Good 7: 6: 5: 4: 3: 2: 1 Bad rated in semantic space they differ in meaning
(e.g., brands D and E). What is needed is a mea-
The numbers do not appear on the respon- sure of the distance between any two concepts.
dent’s protocol. Other numbers could be used, for Correlation comes to mind, but for a variety of
example +3 to –3, but little if anything is gained reasons, it is not suitable. What is used is the D
and the arithmetic becomes more difficult. statistic:
If we are dealing with a single respondent, 
we can compare the semantic space directly. For Di j = di2j
example, Osgood and Luria (1954) analyzed a
case of multiple personality (the famous “3 faces
of Eve”), clearly showing that each personality that is, the distance between any two concepts i
perceived the world in rather drastically different and j equals the square root of the sum of the
terms, as evidenced by the ratings of such con- differences squared. For example, the distance
cepts as father, therapist, and myself. between brand A and brand B in the above exam-
Research projects and the assessment of atti- ple equals:
tudes usually involve a larger number of respon-
dents, and various statistical analyses can be (6 − 2)2 + (5 − 2)2 + (6 − 1)2 + (7 − 1)2 +
applied to the resulting data. Let’s assume (5 − 3)√
2
+ (6 − 2)2 = 106
for example, we are studying attitudes toward and D = 106 or 10.3
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 137

We can do the same for every pair of concepts. If (represented by such items as loving-not loving);
we have n concepts (5 in our example), we will (2) a factor related to the monetary value of
compute the animal (e.g., valuable-worthless); (3) a fac-
tor related to affective value (kind-cruel); and
n (n − 2) (4) a factor related to the “size” of the animal
D values.
2 (cuddly-not cuddly). When only the items that
These D values can be written down in a matrix: had substantial loadings were kept, the 18-item
scale became a 9-item scale, and the four fac-
Brand B C D E tors collapsed into one, namely an evaluative
Brand A 10.30 3.00 2.65 9.06 factor. Scores on the 9-item scale correlated .96
B 8.89 10.44 3.16 with scores on the 18-item scale. In case you’re
C 3.16 8.19 wondering of what use might such a scale be,
D 9.95 you should know that there is a considerable
body of literature and interest on the thera-
Such a D matrix can be analyzed in several ways peutic effects of pet ownership on the elderly,
but the aim is the same: to seek how the con- the handicapped, coronary-care patients, and
cepts cluster together. The smaller the D value the others.
closer in meaning are the concepts. Visually we One of the major concerns about the SemD is
can see that our five brands fall into two clusters: whether in fact the bipolar adjectives are bipo-
brands A, C, and D vs. brands B and E. Statisti- lar – are the terms that anchor each scale truly
cally we can use a variety of techniques including opposite in meaning and equidistant from a true
correlation and factor analysis (Osgood, Suci, & psychological midpoint? Results suggest that for
Tannenbaum, 1957) or more specific techniques some adjective pairs the assumption of bipolarity
(McQuitty, 1957; Nunnally, 1962). is not met (e.g., R. F. Green & Goldfried, 1965;
Although three major factors are obtained in Mann, Phillips, & Thompson, 1979; Schriesheim
the typical study with the SemD, it is highly rec- & Klich, 1991).
ommended that an investigator using the SemD
check the resulting factor structure because there
may be concept-scale interactions that affect such Checklists. One way to assess attitudes, particu-
structure (Piotrowski, 1983; Sherry & Piotrowski, larly toward a large number of issues, is the check-
1986). The evaluative factor seems to be quite list approach. As its name implies, this approach
consistent across samples, but the other two consists of a list of items (people, objects, issues,
dimensions, potency and activity, are less con- etc.) to which the respondent is asked to indi-
sistent. cate their attitude in some way – by checking
The SemD has found wide use in psychology, those items they endorse, selecting “favorable” or
with both adults and children; DiVesta (1965) for “unfavorable” for each item, indicating approval-
example, provides a number of bipolar adjectives disapproval, etc.
that can be used with children. An example of a This is a simple and direct approach, and
SemD scale can be found in the study of Poresky, because all subjects are asked to respond to the
Hendrix, Mosier, et al., (1988) who developed same items, there is comparability of measure-
the Companion Animal Semantic Differential ment. On the other hand, some argue that the
to assess a respondent’s perception of a child- presentation of a number of items can result in
hood companion animal such as a pet dog. They careless responding and hence lowered reliabil-
used 18 bipolar sets of adjectives (bad-good, ity and validity. In addition, the response cat-
clean-dirty, cuddly-not cuddly) and obtained 164 egories typically used do not allow for degree
responses from high-school, college, and gradu- of preference. (I may favor the death penalty
ate students. They used a 6-point scale to score and check that item in the list, but my convic-
each item, rather than the more standard 7-point. tions may not be very strong and might be easily
For the entire scale, the Cronbach alpha was .90 dissuaded.)
indicating substantial reliability. A factor analy- An example of the checklist approach in the
sis indicated four factors: (1) an evaluative factor assessment of attitudes can be found in the work
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

138 Part Two. Dimensions of Testing

of G. D. Wilson and Patterson (1968) who devel- age trends (older persons score higher), gender
oped the conservatism or C scale. differences (females score slightly higher), dif-
ferences between collegiate political groups, and
between scientists and a conservative religious
The C Scale
group.
The liberal-conservative dimension has been In a subsequent study, Hartley and Holt (1971)
studied quite extensively, both as it relates to used only the first half of the scale, but found
political issues and voting behavior and a per- additional validity evidence in various British
sonality syndrome. Many investigators use terms groups; for example, psychology undergraduate
like authoritarianism, dogmatism, or rigidity to students scored lowest, while male “headmasters”
refer to this dimension. Perhaps the major scale in scored higher (female college of education stu-
this area has been the F (fascist) scale developed dents scored highest of all!). On the other hand,
in a study called The Authoritarian Personality J. J. Ray (1971) administered the scale to Aus-
(Adorno et al., 1950). The F scale was for a time tralian military recruits (all 20-year-old males)
widely used, but also severely criticized for being and found an alpha coefficient of +.63 and a
open to acquiescence response set, poor phras- preponderance of “yes” responses. He concluded
ing, and other criticisms. Numerous attempts that this scale was not suitable for random sam-
have been made, not only to develop revised F ples from the general population.
scales but also new scales based on the approach Bagley, Wilson, and Boshier (1970) translated
used with the F scale, as well as entirely differ- the scale into Dutch and compared the responses
ent methodologies, such as that used in the C of Dutch, British, and New Zealander subjects.
scale. A factor analysis indicated that for each of the
G. D. Wilson and Patterson (1968) decided that three samples there was a “strong” general factor
they would use a list of brief labels or “catch- (however, it only accounted for 18.7 of the vari-
phrases” to measure “conservatism,” defined as ance, or less), and the authors concluded that
“resistance to change” and a preference for “safe, not only was there a “remarkable degree of cross-
traditional, and conventional” behavior (G. D. cultural stability” for the scale, but that the C scale
Wilson, 1973). Theoretically, G. D. Wilson and had “considerable potential as an international
Patterson (1968) identified conservatism as char- test of social attitudes.” The C scale was origi-
acterized by seven aspects that included religious nally developed in New Zealand, and is relatively
fundamentalism, intolerance of minority groups, well known in English-speaking countries such
and insistence on strict rules and punishments. as Australia, England, and New Zealand, but has
On the basis of these theoretical notions, they found little utility in the United States. In part,
assembled a pool of 130 items chosen intuitively this may be due to language differences (as Pro-
as reflective of these characteristics. They per- fessor Higgins of My Fair Lady sings: English has
formed three item analyses (no details are given) not been spoken in the United States for quite
and chose 50 items for the final scale. The respon- some time!). For example, one C scale item is
dent is asked which items “do you favor or believe “birching” which means “paddling” as in corpo-
in” and the response options are “yes, ?, no.” ral punishment administered by a teacher. In fact,
For half of the items, a “yes” response indi- a few investigators (e.g., Bahr & Chadwick, 1974;
cates conservatism, and for half of the items a Joe, 1974; Joe & Kostyla, 1975) have adapted the
“no” response indicates conservatism. Examples C scale for American samples by making such
of items (with their conservative response) are: item changes.
the “death penalty (y),” “modern art (n),” “sui- Although the reliability of the C scale would
cide (n),” “teenage drivers (n),” and “learning seem adequate (in the Dutch sample, the split-
Latin (y).” half was .89), Altemeyer (1981) brings up an
G. D. Wilson and Patterson (1968) reported interesting point. He argues that coefficient
a corrected split-half correlation coefficient of alpha, which you recall is one measure of relia-
.94 based on 244 New Zealand subjects. They bility, reflects both the interitem correlations and
also present considerable validity data including the length of the test. Thus, one could have a
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 139

questionnaire with a high coefficient alpha, but How do you feel about capital punishment?
that might simply indicate that the questionnaire Place a check mark on the line:
is long and not necessarily that the questionnaire
1. should be abolished
is unidimensional. In fact, Altemeyer (1981) indi-
2. should be used only for serious & repeat
cates that the average reliability coefficient for the
offenses
C scale is .88, which indicates a mean interitem
3. should be used for all serious offenses
correlation of about .13 – thus, the C scale is
4. is a deterrent & should be retained
criticized for not being unidimensional (see also
5. should be used for all career criminals
Robertson & Cochrane, 1973).
Another example:
Some general comments on rating scales. Like Where would you locate President Clinton on the
checklists, rating scales are used for a wide variety following scale?
of assessment purposes, and the comments here,
although they focus on attitude measurement, An excellent leader.
are meant to generalize to other areas of testing.
Traditionally, rating scales were used to have one Better than most prior presidents.
person assess another, for example, when a clini- Average in leadership.
cal psychologist might assess a client as to degree
of depression, but the rating scales quickly were Less capable than most other presidents.
applied as self-report measures. Totally lacking in leadership capabilities.
One common type of rating scale is numerical
scale, where the choices offered to the respondent Note that a scale could combine both numer-
either explicitly or implicitly are defined numer- ical and graphic properties; essentially what dis-
ically. For example, to the statement, “Suicide tinguishes a graphic scale is the presentation of
goes against the natural law,” we might ask the some device, such as a line, where the respon-
respondent to indicate whether they (a) strongly dent can place their answer. Note also, that from
agree, (b) agree, (c) are not sure, (d) disagree a psychometric point of view, it is easier to “force”
(e) strongly disagree. We may omit the numbers the respondent to place their mark in a particu-
from the actual form seen by the respondent, but lar segment, rather than to allow free reign. In
we would assign those numbers in scoring the the capital punishment example above, we could
response. Sometimes, the numbers are both pos- place little vertical lines to distinguish and sepa-
itive and negative as in: rate the five response options. Or we could allow
the respondent to check anywhere on the scale,
strongly agree not sure disagree strongly even between responses, and generate a score
agree disagree by actually measuring the distance where they
+2 +1 0 –1 –2 placed their mark from the extreme left-hand
beginning of the line. Guilford (1954) discusses
In general, such use of numbers makes life more
these scales at length, as well as other less common
complicated for both the respondent and the
types.
examiner. Mention should be made here, that
there seems to be a general tendency on the part
of some respondents to avoid extreme categories. Self-anchoring scales. Kilpatrick and Cantril
Thus the 5-point scale illustrated above may turn (1960) presented an approach that they called
out to be a 3-point scale for at least some subjects. self-anchoring scaling, where the respondent is
The extension of this argument is that a 7-point asked to describe the top and bottom anchoring
scale is really preferable because in practice it will points in terms of his or her own perceptions,
yield a 5-point scale. values, attitudes, etc. This scaling method grew
Another type of rating scale is the graphic scale out of transactional theory that assumes that we
where the response options follow a straight line live and operate in the world, through the self,
or some variation. For example: both as personally perceived. That is, there is a
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

140 Part Two. Dimensions of Testing

unique reality for each of us – my perception of Designing attitude scales. Oppenheim (1992),
the world is not the same as your perception; in discussing the design of “surveys,” suggests a
what is perceived is inseparable from the series of 14 steps. These are quite applicable to
perceiver. the design of attitude scales and are quite similar
Self-anchoring scales require both open-ended to the more generic steps suggested in Chapter 2.
interviewing, content analysis, and nonverbal They are well worth repeating here (if you wish
scaling. The first step is to ask the respondent additional information on surveys, see Kerlinger,
to describe the “ideal” way of life. Second, he or 1964; Kidder, Judd, & Smith, 1986; Rossi, Wright,
she is asked to describe the “worst” way of life. & Anderson, 1983; Schuman & Kalton, 1985;
Third, he or she is given a pictorial, nonverbal Singer & Presser, 1989):
scale, such as an 11-point ladder:
1. First decide the aims of the study. The aims
should not be simply generic aims (I wish to
10
study the attitudes of students toward physician-
assisted suicide) but should be specific, and take
9 the form of hypotheses to be tested (students who
8
are highly authoritarian will endorse physician-
assisted suicide to a greater degree than less
7 authoritarian).
6 2. Review the relevant literature and carry out
discussions with appropriate informants, indi-
5 viduals who by virtue of their expertise and/or
4 community position are knowledgeable about
the intended topic.
3 3. Develop a preliminary conceptualization of
2 the study and revise it based on exploratory
and/or in depth interviews.
1
4. Spell out the design of the study and assess its
0 feasibility in terms of time, cost, staffing needed,
and so on.
5. Spell out the operational definitions – that is,
if our hypothesis is that “political attitudes are
The respondent is told that a 10 represents the
related to socioeconomic background,” how will
ideal way of life as he or she described it, and
each of these variables be defined and measured?
0 represents the worst way of life. So the two
anchors have been defined by the respondent. 6. Design or adapt the necessary research instru-
Now the respondent is asked, “where on the ments.
ladder are you now?” Other questions may be 7. Carry out pilot work to try out the instru-
asked, such as, “where on the ladder were you five ments.
years ago,” “where will you be in two years,” and 8. Develop a research design: How will respon-
so on. dents be selected? Is a control group needed? How
The basic point of the ladder is that it provides a will participation be ensured?
self-defined continuum that is anchored at either 9. Select the sample(s).
end in terms of personal perception. Other than 10. Carry out the field work: interview subjects
that, the entire procedure is quite flexible. Fewer and/or administer questionnaires.
or more than 11 steps may be used; the numbers
11. Process the data: code and/or score the
themselves may be omitted; a rather wide variety
responses, enter the data into the computer.
of concepts can be scaled; and instructions may be
given in written form rather than as an interview, 12. Carry out the appropriate statistical analyses.
allowing the simultaneous assessment of a group 13. Assemble the results.
of individuals. 14. Write the research report.
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 141

Writing items for attitude scales. Much of our Measuring attitudes in specific situations.
earlier discussion on writing test items also There are a number of situations where the assess-
applies here. Writing statements for any psy- ment of attitudes might be helpful, but available
chometric instrument is both an art and a sci- scales may not quite fit the demands of the sit-
ence. A number of writers (e.g., A. L. Edwards, uation. For example, a city council may wish to
1957a; A. L. Edwards & Kilpatrick, 1948; Payne, determine how citizens feel toward the potential
1951; Thurstone & Chave, 1929; Wang, 1932), construction of a new park, or the regents of a
have made many valuable suggestions such as, university might wish to assess whether a new
make statements brief, unambiguous, simple, academic degree should be offered. The same
and direct; each statement should focus on steps we discussed in Chapter 2 might well be
only one idea; avoid double negatives; avoid used here (or the steps offered by Oppenheim
“apple pie and motherhood” type of statements [1992] above). Perhaps it might not be necessary
that everyone agrees with; don’t use universals to have a “theory” about the proposed issue, but
such as “always” or “never”; don’t use emo- it certainly would be important to identify the
tionally laden words such as “adultery,” “Com- objectives that are to be assessed and to produce
munist,” “agitator”; where possible, use positive items that follow the canons of good writing.
rather than negative wording. For attitude scales,
one difference is that factual statements, perti-
VALUES
nent in achievement testing, do not make good
items because individuals with different attitudes Values also play a major role in life, especially
might well respond identically. because, as philosophers tell us, human beings
Ambiguous statements should not be used. For are metaphysical animals searching for the pur-
example, “It is important that we give Venusians pose of their existence. Such purposes are guide-
the recognition they deserve” is a poor statement lines for life or values (Grosze-Nipper & Rebel,
because it might be interpreted positively (Venu- 1987). Like the assessment of attitudes, the assess-
sians should get more recognition) or negatively ment of values is also a very complex under-
(Venusians deserve little recognition and that’s taking, in part because values, like most other
what they should get). A. L. Edwards (1957a) psychological variables, are constructs, i.e.,
suggested that a good first step in the prelimi- abstract conceptions. Different social scientists
nary evaluation of statements is to have a group have different conceptions and so perceive val-
of individuals answer the items first as if they had ues differently, and there does not seem to be a
a favorable attitude and then as if they had an uniformly accepted way of defining and concep-
unfavorable attitude. Items that show a distinct tualizing values. As with attitudes, values cannot
shift in response are most likely useful items. be measured directly, we can only infer a person’s
values by what they say and/or what they do. But
Closed vs. open response options. Most atti- people are complex and do not necessarily behave
tude scales presented in the literature use closed in logically consistent ways. Not every psychol-
response options; this is the case in both the ogist agrees that values are important; Mowrer
Thurstone and Likert methods where the respon- (1967) for example, believed that the term
dent endorses (or not) a specific statement. We “values” was essentially useless.
may also wish to use open response options,
where respondents are asked to indicate in their Formation and changes in values. Because of
own words what their attitude is – for example, the central role that values occupy, there is a vast
“How valuable were the homework assignments body of literature, both experimental and the-
in this class?” “Comment on the textbook used,” oretical, on this topic. One intriguing question
and so on. Closed response options are advan- concerns how values are formed and how values
tageous from a statistical point of view. Open change. Hoge and Bender (1974) suggested that
response options are more difficult to handle sta- there are three theoretical models that address
tistically, but can provide more information and this issue. The first model assumes that values are
allow respondents to express their feelings more formed and changed by a vast array of events and
directly. Both types of items can of course be used. experiences. We are all in the same “boat” and
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

142 Part Two. Dimensions of Testing

whatever affects that boat affects all of us. Thus, Reliability. For a sample of 100 subjects, the
as our society becomes more violence-prone and corrected split-half reliabilities for the six scales
materialistic, we become more violence-prone range from .84 to .95, with a mean of .90. Test-
and materialistic. A second model assumes that retest reliabilities are also reported for two small
certain developmental periods are crucial for samples, with a 1-month and a 2-month inter-
the establishment of values. One such period val. These values are also quite acceptable, rang-
is adolescence, and so high school and the ing from .77 to .93 (Allport, Vernon, & Lindzey,
beginning college years are “formative” years. 1960). Hilton and Korn (1964) administered the
This means that when there are relatively rapid SoV seven times to 30 college students over a
social changes, different cohorts of individuals 7-month period (in case you’re wondering, the
will have different values. The third model also students were participating in a study of career
assumes that values change developmentally, but decision making, and were paid for their partici-
the changes are primarily a function of age – for pation). Reliability coefficients ranged from a low
example, as people become older, they become of .74 for the political value scale to a high of .91
more conservative. for the aesthetic value scale. Subsequent studies
have reported similar values.
The Study of Values (SoV)
An ipsative scale. The SoV is also an ipsative
The SoV (Allport, Vernon, & Lindzey, 1960; measure: if you score high on one scale you
Vernon & Allport, 1931) was for many years the must score lower on some or all of the oth-
leading measure of values, used widely by social ers. As the authors state in the test manual, it is
psychologists, in studies of personality, and even not quite legitimate therefore to ask whether the
as a counseling and guidance tool. The SoV seems scales intercorrelate. Nevertheless, they present
to be no longer popular, but it is still worthy of the intercorrelations based on a sample of 100
a close look. The SoV, originally published in males and a sample of 100 females. As expected,
1931 and revised in 1951, was based on a the- most of the correlations are negative, ranging in
ory (by Spranger, 1928) that assumed there were magnitude and sign from a −.48 (for religious vs.
six basic values or personality types: theoreti- theoretical, in the female sample) to a +.27 for
cal, economic, aesthetic, social, political, and reli- political vs. economic (in the male sample), and
gious. As the authors indicated (Allport, Vernon, religious vs. social (in the female sample).
& Lindzey, 1960) Spranger held a rather positive
view of human nature and did not consider the Validity. There are literally hundreds of studies
possibility of a “valueless” person, or someone in the literature that used the SoV, and most
who followed expediency (doing what is best for support its validity. One area in which the SoV
one’s self) or hedonism (pleasure) as a way of life. has been used is to assess the changes in val-
Although the SoV was in some ways designed ues that occur during the college years; in fact,
to operationalize Spranger’s theory, the studies K. A. Feldman and Newcomb (1969) after review-
that were subsequently generated were only min- ing the available literature, believed that the
imally related to Spranger’s views; thus, while SoV was the best single source of information
the SoV had quite an impact on psychological about such changes. The study by Huntley (1965)
research, Spranger’s theory did not. although not necessarily representative, is illus-
The SoV was composed of two parts consisting trative and interesting. Huntley (1965), admin-
of forced-choice items in which statements rep- istered the SoV to male undergraduate college
resenting different values were presented, with students at entrance to college and again just
the respondent having to choose one. Each of the prior to graduation. Over a 6-year period some
6 values was assessed by a total of 20 items, so 1,800 students took the test, with 1,027 having
the entire test was composed of 120 items. The both “entering” and “graduating” profiles. The
SoV was designed primarily for college students students were grouped into nine major fields
or well-educated adults, and a somewhat unique of study, such as science, engineering, and pre-
aspect was that it could be hand scored by the med, according to their graduation status. Hunt-
subject. ley (1965) then asked, and answered, four basic
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 143

questions: (1) Do values (i.e., SoV scores) change values. Others repeatedly pointed out that the
significantly during the 4 years of college? Of values assessed were based on “ideal” types and
the 54 possible changes (9 groups of students × did not necessarily match reality; furthermore,
6 values), 27 showed statistically significant these values appeared to be closely tied to “middle
changes, with specific changes associated with class” values.
specific majors. For example, both humanities
and pre-med majors increased in their aesthetic
value and decreased in their economic value, The Rokeach Value Survey (RVS)
while industrial administration majors increased
Introduction. One of the most widely used sur-
in both their aesthetic and economic values; (2)
veys of values is the Rokeach Value Survey (RVS).
Do students who enter different majors show
Rokeach (1973) defined values as beliefs concern-
different values at entrance into college? Indeed
ing either desirable modes of conduct or desirable
they do. Engineering students, for example, have
end-states of existence. The first type of values is
high economic and political values, while physics
what Rokeach labeled instrumental values, in that
majors have low economic and political values;
they are concerned with modes of conduct; the
(3) What differences are found among the nine
second type of values are terminal values in that
groups at graduation? Basically the same pattern
they are concerned with end states. Furthermore,
of differences that exist at entrance. In fact, if the
Rokeach (1973) divided instrumental values into
nine groups are ranked on each of the values, and
two types: moral values that have an interper-
the ranks at entrance are compared with those at
sonal focus, and competence or self-actualization
graduation, there is a great deal of stability. In
values that have a personal focus. Terminal values
addition, what appears to happen is that value
are also of two types: self-centered or personal,
differences among groups are accentuated over
and society-centered or social.
the course of the four collegiate years; (4) Are
Rokeach (1973) distinguished values from atti-
there general trends? Considering these students
tudes in that a value refers to a single belief, while
as one cohort, theoretical, social, and political
an attitude concerns an organization of several
values show no appreciable change (keep in mind
beliefs centered on a specific target. Furthermore,
that these values do change for specific majors).
values transcend the specific target, represent a
Aesthetic values increase, and economic and reli-
standard, are much smaller in number than atti-
gious values decrease, regardless of major.
tudes, and occupy a more central position in a
person’s psychological functioning.
Norms. The test manual presents norms based
on 8,369 college students. The norms are subdi-
vided by gender as well as by collegiate institution. Description. The RVS is a rather simple affair
In addition, norms are presented for a wide range that consists of two lists of 18 values each, which
of occupational groups, with the results support- the respondent places in rank order, in order
ing the construct validity of the SoV. For example, of importance as guiding principles of their life.
clergymen and theological students score high- Table 6.2 illustrates the RVS. Note that each value
est on the religious value. Engineering students is accompanied by a short, defining phrase.
score highest on the theoretical value, while busi- Originally the RVS consisted simple of printed
ness administration students score highest on the lists; subsequently, each value is printed on a
economic and political scales. Subsequent norms removable gummed label, and the labels are
included a national sample of high-school stu- placed in rank order. The two types of values,
dents tested in 1968, and composed of more than instrumental and terminal, are ranked and ana-
5000 males and 7,000 females. Again, given the lyzed separately, but the subtypes (such as per-
ipsative nature of this scale, we may question the sonal and social) are not considered. The RVS
appropriateness of norms. is then a self-report instrument, group adminis-
tered, with no time limit, and designed for adoles-
Criticisms. Over the years, a variety of criticisms cents and adults. Rokeach (1973) suggests that the
have been leveled at the SoV. For example, Gage RVS is really a projective test, like the Rorschach
(1959) felt that the SoV confounded interests and Inkblot technique, in that the respondent has no
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

144 Part Two. Dimensions of Testing

Table 6–2. RVS values


Terminal values Instrumental values
A comfortable life (a prosperous life) Ambitious (hard-working, aspiring)
An exciting life (a stimulating, active life) Broadminded (open-minded)
A sense of accomplishment (lasting contribution) Capable (competent, effective)
A world at peace (free of war and conflict) Cheerful* (lighthearted, joyful)
A world of beauty (beauty of nature and the arts) Clean (neat, tidy)
Equality (brotherhood, equal opportunity for all) Courageous (standing up for your beliefs)
Family security (taking care of loved ones) Forgiving (willing to pardon others)
Freedom (independence, free choice) Helpful (working for the welfare of others)
Happiness* (contentedness) Honest (sincere, truthful)
Inner harmony (freedom from inner conflict) Imaginative (daring, creative)
Mature love (sexual and spiritual intimacy) Independent (self-reliant, self-sufficient)
National security (protection from attack) Intellectual (intelligent, reflective)
Pleasure (an enjoyable, leisurely life) Logical (consistent, rational)
Salvation (saved, eternal life) Loving (affectionate, tender)
Self-respect (self-esteem) Obedient (dutiful, respectful)
Social recognition (respect, admiration) Polite (courteous, well mannered)
True friendship (close companionship) Responsible (dependable, reliable)
Wisdom (a mature understanding of life) Self-controlled (restrained, self-disciplined)

Note: These values were later replaced by health and loyal respectively.
Adapted with the permission of The Free Press, a Division of Simon & Schuster from The Nature Of Human Values by
Milton Rokeach. Copyright  C 1973 by The Free Press.

guidelines for responding other than his or her each student. Students classified as “high iden-
own internalized system of values. tity achievement” ranked the RVS instrumental
How did Rokeach arrive at these particular values as follows:
36 values? Basically through a clinical process
that began with amassing a large number of value mean ranking
value labels from various sources (the instrumen- honest 5.50
tal values actually began as personality traits), responsible 5.68
eliminating those that were synonymous, and loving 5.96
in some cases those that intercorrelated highly. broadminded 6.56
Thus there is the basic question of content valid- independent 7.48
ity and Rokeach (1973) himself admits that his capable 7.88
procedure is “intuitive” and his results differ from etc.
those that might have been obtained by other
researchers. We can change the mean rank values back to ranks
by calling honest = 1, responsible = 2, loving =
3, and so on.
Scoring the RVS. Basically, there is no scoring Another scoring approach would be to sum-
procedure with the RVS. Once the respondent mate together subsets of values that on the basis
has provided the two sets of 18 ranks, the ranks of either a statistical criterion such as factor anal-
cannot of course be added together to get a sum ysis, or a clinical judgment such as content anal-
because every respondent would obtain exactly ysis, seem to go together. For example, Silver-
the same score. man, Bishop, and Jaffe (1976) studied the RVS
For a group of individuals we can compute responses of some 954 psychology graduate stu-
for each value the mean or median of the rank dents. To determine whether there were differ-
assigned to that value. We can then convert ences between students who studied different
these average values into ranks. For example, J. fields of psychology (e.g., clinical, experimen-
Andrews (1973) administered the RVS to 61 col- tal, developmental), the investigators computed
lege students, together with a questionnaire to the average of the median rankings assigned
assess the degree of “ego identity” achieved by to “mature love,” “true friendship,” “cheerful,”
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 145

“helpful,” and “loving” – this cluster of values It is interesting to note that all of the 36 values
was labeled “interpersonal affective values.” A are socially desirable, and that respondents often
similar index called “cognitive competency” was indicate that the ranking task is a difficult one
calculated by averaging the median rankings for and they have “little confidence” that they have
“intellectual” and “logical.” done so in a reliable manner.

Validity. Rokeach’s (1973) book is replete with


Reliability. There are at least two ways of assess- various analyses and comparisons of RVS rank-
ing the temporal stability (i.e., test-retest relia- ings, including cross-cultural comparisons and
bility) of the RVS. One way is to administer the analyses of such variables such as as race, socioe-
RVS to a group of individuals and retest them conomic status, educational level, and occupa-
later. For each person, we can correlate the two tion. The RVS has also been used in hundreds
sets of ranks and then can compute the median of of studies across a wide spectrum of topics, with
such rank order correlation coefficients for our most studies showing encouraging results that
sample of subjects. Rokeach (1973) reports such support the construct validity of this instrument.
medians as ranging from .76 to .80 for terminal These studies range from comparisons of women
values and .65 to .72 for instrumental values, with who prefer “Ivory” as a washing machine deter-
samples of college students retested after 3 weeks gent to studies of hippies (Rokeach, 1973). One
to 4 months. area where the study of values has found substan-
Another way is also to administer the RVS tial application is that of psychotherapy, where
twice, but to focus on each value separately. We the values of patients and of therapists and their
may for example, start out with “a comfortable concomitant changes, have been studied (e.g.,
life.” For each subject in our sample, we have the Beutler, Arizmendi, Crago, et al., 1983; Beutler,
two ranks assigned to this value. We can then Crago, & Arizmendi, 1986; Jensen & Bergin, 1988;
compute a correlation coefficient across subjects Kelly, 1990).
for that specific value. When this is done, sepa-
rately for each of the 36 values, we find that the Cross-cultural aspects. Rokeach (1973) believed
reliabilities are quite low; for the terminal val- that the RVS could be used cross-culturally
ues the average reliability is about .65 (Rokeach, because the values listed are universal and prob-
1973) and for the instrumental values it is about lems of translation can be surmounted. On the
.56 (Feather, 1975). This is of course not sur- other hand, it can be argued that these values are
prising because each “scale” is made up of only relevant to Western cultures only; for example,
one item. One important implication of such low “filial piety,” a central value for Chinese is not
reliability is that the RVS should not be used for included in the RVS. It can also be argued that
individual counseling and assessment. although the same word can be found in two lan-
One problem, then, is that the reliability of the guages, it does not necessarily have the same lay-
RVS is marginal at best. Rokeach (1973) presents ers of meaning in the two cultures. Nevertheless,
the results of various studies, primarily with col- a number of investigators have applied the RVS
lege students, and with various test-retest inter- cross-culturally, both in English-speaking coun-
vals ranging from 3 weeks to 16 months; of the 29 tries such as Australia and non-Western cultures
coefficients given, 14 are below .70, and all range such as China (e.g., Feather, 1986; Lau, 1988; Ng
from .53 to .87, with a median of .70. Inglehart et al., 1982).
(1985), on the other hand, looked at the results of An example of a cross-cultural application is
a national sample, one assessed in 1968 and again found in the study by Domino and Acosta (1987),
in 1981. Because there were different subjects, it is who administered the RVS to a sample of first
not possible to compute correlation coefficients, generation Mexican Americans. These individu-
but Inglehart (1985) reported that the stability als were identified as being either “highly accul-
of rankings over the 13-year period was “phe- turated,” that is more American, or “less accul-
nomenal.” The six highest- and six lowest-ranked turated,” that is more Mexican. Their rankings
values in 1968 were also the six highest-and six of the RVS were then analyzed in various ways,
lowest-ranked values in 1981. including comparisons with the national norms
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

146 Part Two. Dimensions of Testing

Table 6–3. Factor structure of the RVS Based on a sample of 1,409 respondents (Rokeach,
1973)
Example of item with
Factor Positive loading Negative loading Percentage of variance
1. Immediate vs. delayed gratification A comfortable life Wisdom 8.2
2. Competence vs. religious morality Logical Forgiving 7.8
3. Self-constriction vs. self-expansion Obedient Broadminded 5.5
4. Social vs. personal orientation A world at peace True friendship 5.4
5. Societal vs. family security A world of beauty Family security 5.0
6. Respect vs. love Social recognition Mature Love 4.9
7. Inner vs. other directed Polite Courageous 4.0

provided by Rokeach and with local norms based their results suggested eight factors rather than
on Anglos. These researchers found a greater seven.
correspondence of values between high accul-
turation subjects and the comparison groups Norms. Rokeach (1973) presents the rankings
than between the low acculturation subjects and for a group of 665 males and a group of 744
the comparison groups – those that were more females, and these are presented in Table 6.4.
“American” in their language and general cul- Note that of the 36 values, 20 show significant
tural identification were also more American in gender differences. Even though the ranks may
their values. be identical, there may be a significant differ-
ence on the actual rank value assigned. The
differences seem to be in line with the differ-
Factor analysis. Factor analytic studies do seem ent ways that men and women are socialized
to support the terminal-instrumental differenti- in Western cultures, with males endorsing more
ation, although not everyone agrees (e.g., Crosby, achievement and intellectually oriented values,
Bitner, & Gill, 1990; Feather & Peay, 1975; Heath more materialistic and pleasure seeking, while
& Fogel, 1978; Vinson et al., 1977). Factor analy- women rank higher religious values, love, per-
ses suggest that the 36 values are not independent sonal happiness, and lack of both inner and outer
of each other and that certain values do cluster conflict.
together. Rokeach (1973) suggests that there are
seven basic factors that cut across the terminal- Rank order correlation coefficient. Despite the
instrumental distinction. These factors are indi- caveat that the RVS should not be used for indi-
cated in Table 6.3. One question that can be asked vidual counseling, we use a fictitious example to
of the results of a factor analysis is how “impor- illustrate the rank order correlation coefficient,
tant” each factor is. Different respondents give designed to compare two sets of ranks. Let’s say
different answers (ranks) to different values. This that you and your fiance are contemplating mar-
variation of response can be called “total vari- riage, and you wonder whether your values are
ance.” When we identify a factor, we can ask compatible. You both independently rank order
how much of the total variance does that fac- the RVS items. The results for the instrumental
tor account for? For the RVS data reported in values are shown in Table 6.5. The question here
Table 6.3, factor 1 accounts for only 8.2% of is how similar are the two sets of values? We can
the total variation, and in fact all seven factors easily calculate the rank order correlation coeffi-
together account for only 40.8% of the total varia- cient (ρ) using the formula:
tion, leaving 59.2% of the variation unaccounted 
6 D2
for. This suggests that the factors are probably ρ =1−
not very powerful, either in predicting behavior N (N 2 − 1)
or in helping us to conceptualize values. Heath where N stands for the number of items being
and Fogel (1978) had subjects rate rather than ranked; in this case N = 18. All we need to do
rank the importance of each of the 36 values; is calculate for each set of ranks the difference
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 147

Table 6–4. Values medians and composite rank orders for American men and women
(Rokeach, 1973)
Terminal value: Male (n = 665) Female (n = 744) Lower rank shown by
A comfortable life 7.8 (4) 10.0 (13) Males
An exciting life 14.6 (18) 15.8 (18) Males
A sense of 8.3 (7) 9.4 (10) Males
accomplishment
A world at peace 3.8 (1) 3.0 (1) Females
A world of beauty 13.6 (15) 13.5 (15) −
Equality 8.9 (9) 8.3 (8) −
Family security 3.8 (2) 3.8 (2) −
Freedom 4.9 (3) 6.1 (3) Males
Happiness 7.9 (5) 7.4 (5) Females
Inner harmony 11.1 (13) 9.8 (12) Females
Mature love 12.6 (14) 12.3 (14) −
National security 9.2 (10) 9.8 (11) −
Pleasure 14.1 (17) 15.0 (16) Males
Salvation 9.9 (12) 7.3 (4) Females
Self-respect 8.2 (6) 7.4 (6) Females
Social recognition 13.8 (16) 15.0 (17) Males
True friendship 9.6 (11) 9.1 (9) –
Wisdom 8.5 (8) 7.7 (7) Females
Instrumental values
Ambitious 5.6 (2) 7.4 (4) Males
Broadminded 7.2 (4) 7.7 (5) –
Capable 8.9 (8) 10.1 (12) Males
Cheerful 10.4 (12) (9.4) (10) Females
Clean 9.4 (9) 8.1 (8) Females
Courageous 7.5 (5) 8.1 (6) –
Forgiving 8.2 (6) 6.4 (2) Females
Helpful 8.3 (7) 8.1 (7) –
Honest 3.4 (1) 3.2 (1) –
Imaginative 14.3 (18) 16.1 (18) Males
Independent 10.2 (11) 10.7 (14) –
Intellectual 12.8 (15) 13.2 (16) –
Logical 13.5 (16) 14.7 (17) –
Loving 10.9 (14) 8.6 (9) Females
Obedient 13.5 (17) 13.1 (15) –
Polite 10.9 (13) 10.7 (13) –
Responsible 6.6 (3) 6.8 (3) –
Self-controlled 9.7 (10). 9.5 (11) –
Note: The figures shown are median rankings and in parentheses composite rank orders.
The gender differences are based on median rankings.
Adapted with the permission of The Free Press, a Division of Simon & Schuster from The Nature of Human Values by
Milton Rokeach. Copyright C 1973 by The Free Press.

between ranks, square each difference, and find fiance as to what values are important in life,
the sum. This is done in Table 6.5, in the columns and indeed a perusal of the rankings suggest
labeled D (difference) and D2 . The sum is 746, some highly significant discrepancies (e.g., self-
and substituting in the formula gives us: controlled and courageous), some less significant
6 (746) 746 discrepancies (e.g., cheerful and clean), and some
=1− =1− = 1 − .77 = +.23 near unanimity (ambitious and broadminded).
5814 969 If these results were reliable, one might predict
These results would suggest that there is a very some conflict ahead, unless of course you believe
low degree of agreement between you and your in the “opposites attract” school of thought
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

148 Part Two. Dimensions of Testing

Table 6–5. Computational example of the rank order Penner, Homant, & Rokeach,
correlation coefficient using RVS data 1968; Rankin & Grobe, 1980).
Instrumental Your fiance’s
Interestingly enough, some of
value Your rank rank D D2 the results suggest that rank-
order scaling is a better tech-
Ambitious 2 1 1 1
Broadminded 8 9 1 1 nique than other approaches
Capable 4 2 2 4 (e.g., Miethe, 1985).
Cheerful 12 7 5 25
Clean 15 10 5 25
Courageous 5 16 11 121 INTERESTS
Forgiving 6 12 6 36
Helpful 7 17 10 100 We now turn to the third area
Honest 1 11 10 100 of measurement for this chap-
Imaginative 18 14 4 16 ter, and that is interests, and
Independent 11 3 8 64 more specifically, career inter-
Intellectual 10 15 5 25
ests. How can career interests be
Logical 9 4 5 25
Loving 13 8 5 25 assessed? The most obvious and
Obedient 14 18 4 16 direct method is to ask individ-
Polite 16 13 3 9 uals what they are interested in.
Responsible 3 6 3 9 These are called expressed inter-
Self-controlled 17 5 12 144
 ests, and perhaps not surpris-
= 746 ingly, this is a reasonably valid
method. On the other hand,
people are often not sure what their interests
rather than the “birds of a feather flock together”
are, or are unable to specify them objectively,
approach.
or may have little awareness of how their par-
ticular interests and the demands of the world of
Criticisms. The RVS has been criticized for a work might dovetail. A second way is the assess-
number of reasons (Braithwaite & Law, 1985; ment of such likes and dislikes through inven-
Feather, 1975). It is of course an ipsative mea- tories. This method is perhaps the most pop-
sure and yields only ordinal data; strictly speak- ular method and has a number of advantages,
ing, its data should not be used with analysis including the fact that it permits an individual to
of variance or other statistical procedures that compare their interests with those of other peo-
require a normal distribution, although such pro- ple, and more specifically with people in various
cedures are indeed “robust” and seem to apply occupations. A third way is to assume that some-
even when the assumptions are violated. Others one interested in a particular occupation will have
have questioned whether the RVS measures what a fair amount of knowledge about that occupa-
one prefers or what one ought to prefer (Bolt, tion, even before entering the occupation. Thus
1978) and the distinction between terminal and we could put together a test of knowledge about
instrumental values (Heath & Fogel, 1978). being a lawyer and assume that those who score
One major criticism is that the rank ordering high may be potential lawyers. That of course is
procedure does not allow for the assessment of a major assumption, not necessarily reflective of
intensity, which is basically the same criticism the real world. Finally, we can observe a person’s
that this is not an interval scale. Thus two indi- behavior. If Johnny, a high school student, spends
viduals can select the same value as their first all of his spare time repairing automobiles, we
choice, and only one may feel quite sanguine might speculate that he is headed for a career as
about it. Similarly, you may give a value a rank auto mechanic – but of course, our speculations
of 2 because it really differs from your num- may be quite incorrect.
ber 1 choice, but the difference may be mini- The field of career interest measurement has
mal for another person with the identical rank- been dominated by the work of two individuals.
ings. In fact, several researchers have modified the In 1927, E. K. Strong, Jr. published the Strong
RVS into an interval measure (e.g., Moore, 1975; Vocational Interest Blank for Men, an empirically
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 149

based inventory that compared a person’s likes to any of these inventories (except in the rare
and dislikes with those of individuals in dif- instances where this would violate the intended
ferent occupations. The SVIB and its revisions meaning).
became extremely popular and were used fre-
quently in both college settings and private prac- Description. Basically, the Strong compares a
tice (Zytowski & Warman, 1982). In 1934, G. F. person’s career interests with those of people
Kuder developed the Kuder Preference Record, who are satisfactorily employed in a wide vari-
which initially used content scales (e.g., agri- ety of occupations. It is thus a measure of inter-
culture) rather than specific occupational scales. ests, not of ability or competence. The Strong
This test also proved quite popular and under- contains 325 items grouped into seven sections.
went a number of revisions. The bulk of the items (first five sections) require
A third key event in the history of career the respondent to indicate like, dislike, or indif-
interest assessment occurred in 1959, when John ferent to 131 occupations (Would you like to
Holland published a theory regarding human be a dentist? a psychologist?), 36 school sub-
behavior that found wide applicability to career jects (algebra, literature), 51 career-related activ-
interest assessment. Holland argued that the ities (carpentry; gardening; fund raising), 39
choice of an occupation is basically a reflection leisure activities (camping trips; cooking), and
of one’s personality, and so career-interest inven- 24 types of people (Would you like to work with
tories are basically personality inventories. children? the elderly? artists?). Section 6 requires
Much of the literature and efforts in career the respondent to select from pairs of activi-
assessment depend on a general assumption that ties that they prefer (Would you prefer work-
people with similar interests tend to enter the ing with “things” or with people?), and section
same occupation, and to the degree that one’s 7 has some self-descriptive statements (Are you
interests are congruent with those of people in a patient person?). Strong originally used these
that occupation, the result will be greater job sat- various types of items in an empirical effort to
isfaction. There certainly seems to be substantial see which type worked best. Subsequent research
support for the first part of that assumption, but suggests that item content is more important than
relatively little for the second part. item format, and so the varied items have been
retained also because they relieve the monotony
of responding to a long list of similar questions
The Strong Interest Inventory (SII)
(D. P. Campbell, 1974).
Introduction. The Strong Vocational Interest The primary aim of the Strong is for coun-
Blank for Men (SVIB) is the granddaddy of all seling high school and college students as well
career-interest inventories, developed by E. K. as and adults who are college graduates, about
Strong, and originally published in 1927. A sepa- their career choices. It and particularly focuses
rate form for women was developed in 1933. The on those careers that attract college graduates,
male and female forms were each revised twice, rather than blue-collar occupations or skilled
separately. In 1974, the two gender forms were trades such electrician and plumber. Thus the
merged into one. The SVIB became the Strong- Strong is geared primarily for age 17 and older.
Campbell Interest Inventory (SCII) and under- Career interests seem to stabilize for most people
went extensive revisions (D. P. Campbell, 1974; between the ages of 20 and 25, so the Strong is
D. P. Campbell & J. C. Hansen, 1981; J. C. Hansen most accurate for this age range; it does not seem
& D. P. Campbell, 1985), including the develop- to be appropriate or useful for anyone younger
ment of occupational scales that were tradition- than 16.
ally linked with the opposite sex. For example, a It is not the intent of the Strong to tell a per-
nursing scale for males and a carpenter and elec- son what career they should enter or where they
trician scales for women. Recently, the name was can be successful in the world of work. In fact,
changed to the Strong Interest Inventory (SII) the Strong has little to do with competence and
(or Strong for short), and a 1994 revision pub- capabilities; a person may have a great deal of
lished. To minimize confusion and reduce the similarity of interest with those shown by physi-
alphabet soup, the word Strong is used to refer cians, but have neither the cognitive abilities nor
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

150 Part Two. Dimensions of Testing

the educational credentials required to enter and Scale development. Let’s assume you want to
do well in medical school. develop an occupational scale for “golf instruc-
There are at least two manuals available for the tors.” How might you go about this? J. C. Hansen
professional user: the Manual, which contains the (1986) indicates that there are five steps in the
technical data (J. C. Hansen & D. P. Campbell, construction of an occupational scale for the
1985), and the User’s Guide (J. C. Hansen, 1984), Strong:
which is more “user friendly” and more of a
typical manual. 1. You need to collect an occupational sam-
ple, in this case, golf instructors. Perhaps you
might identify potential respondents through
Item selection. Where did the items in the
some major sports organization, labor union, or
Strong come from? Originally, they were gen- other societies that might provide such a roster.
erated by Strong and others, and were basically Your potential respondents must however, sat-
the result of “clinical insight.” Subsequently, the isfy several criteria (in addition to filling out the
items contained in the current Strong came from Strong): they must be satisfied with their occupa-
earlier editions and were selected on the basis tion, be between the ages of 25 and 60, have at least
of their psychometric properties (i.e., reliability 3 years of experience in that occupation, and per-
and validity), as well as on their “public relations” form work that is “typical” of that occupation –
aspects – that is, they would not offend, irritate, for example, a golf instructor who spends his or
or embarrass a respondent. As in other forms of her time primarily designing golf courses would
testing, items that yield variability of response or be eliminated.
response range are the most useful. D. P. Campbell 2. You also need a reference group – although
and J. C. Hansen (1981), for example, indicate ordinarily you would use the available data based
that items such as “funeral director” and “geog- on 300 “men in general” and 300 “women in gen-
raphy” were eliminated because almost everyone eral.” This sample has an average age of 38 years,
indicates “dislike” to the former and “like” to the represents a wide variety of occupations, half pro-
latter. An item such “college professor” on the fessional and half nonprofessional.
other hand yields “like” responses of about 5% 3. Once you’ve collected your data, you’ll need
in samples of farmers to 99% in samples of behav- to compare for each of the 325 Strong items,
ioral scientists. the percent of “like,” “indifferent,” or “dislike”
Other criteria were also used in judging responses. The aim here is to identify 60 to 70
whether an item would be retained or elimi- items that show a response difference of 16% or
nated. Both predictive and concurrent validity greater.
are important and items showing these aspects 4. Now you can assign scoring weights to each
were retained. For example, the Strong should of the 60 to 70 items. If the golf instructors
have content validity and so the items should endorsed “like” more often than the general sam-
cover a wide range of occupational content. ple, that item is scored +1; if the golf instructors
Because sex-role bias was of particular concern, endorsed “dislike” more often, then the item is
items were modified (policeman became police scored −1 (for like). If there are substantial differ-
officer) or otherwise changed. Items that showed ences between the two samples on the “indiffer-
a significant gender difference in response were ent” response, then that response is also scored.
not necessarily eliminated, as the task is to under- 5. Now you can obtain the raw scores for each of
stand such differences rather than to ignore them. your golf instructors, and compute your norma-
Because the United States is such a conglomera- tive data, changing the raw scores to T scores.
tion of minorities, and because the Strong might
be useful in other cultures, items were retained
if they were not “culture bound,” although the Development. In more general terms then, the
actual operational definition of this criterion occupational scales on the Strong were developed
might be a bit difficult to give. Other criteria, by administering the Strong pool of items to men
such as reading level, lack of ambiguity, and cur- and women in a specific occupation and com-
rent terminology, were also used. paring the responses of this criterion group with
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 151

those of men, or women, in general. Although the separate answer sheet must be returned to the
various criterion groups were different depend- publisher for computer scoring.
ing on the occupation, they were typically large,
with Ns over 200, and more typically near 400. Scoring. The current version of the Strong needs
They were composed of individuals between the to be computer scored and several such services
ages of 25 and 55, still active in their occupation, are available. The Strong yields five sets of scores:
who had been in that occupation for at least 3
years and thus presumably satisfied, who indi- 1. Administrative Indices
cated that they liked their work, and who had 2. General Occupational Themes
met some minimum level of proficiency, such 3. Basic Interest Scales
as licensing, to eliminate those who might be 4. Occupational Scales
incompetent. 5. Special Scales
The comparison group, the men-in-general or
women-in-general sample is a bit more difficult The Administrative Indices are routine clerical
to define, because its nature and composition checks performed by the computer as the answer
has changed over the years. When Strong began sheet is scored; they are designed to assess proce-
his work in the mid-1920s, the in-general sam- dural errors and are for use by the test admin-
ple consisted of several thousand men he had istrator to determine whether the test results
tested. Later he collected a new sample based are meaningful. These indices include the num-
on U.S. Census Bureau statistics, but the sam- ber of items that were answered, the number
ple contained too many unskilled and semiskilled of infrequent responses given, and the percent-
men. When response comparisons of a criterion ages of like, dislike, and indifferent responses
group were made to this comparison group, the given for each of the sections. For example, one
result was that professional men shared similar administrative index is simply the total number
interests among themselves as compared with of responses given. There are 325 items, and a
nonprofessional men. The end result would have respondent may omit some items, or may unin-
been a number of overlapping scales that would tentionally skip a section, or may make some
be highly intercorrelated and therefore of little marks that are too light to be scored. A score
use for career guidance. For example, a physi- of 310 or less alerts the administrator that the
cian scale would have reflected the differences in resulting profile may not be valid.
interests between men in a professional occupa- The General Occupational Themes are a set
tion and men in nonprofessional occupations; of six scales each designed to portray a “general”
a dentist scale would have reflected those same type as described in Holland’s theory (discussed
differences. next). These scales were developed by selecting 20
From 1938 to 1966 the in-general sample was a items to represent each of the 6 types. The items
modification of the U.S. Census Bureau sample, were selected on the basis of both face and content
but included only those men whose salary would validity (they covered the typological descrip-
have placed them in the middle class or above. tions given by Holland); and statistical criteria
From 1966 onward, a number of approaches such as item-scale correlations.
were used, including a women-in-general sam- The Basic Interest Scales consist of 23 scales
ple, composed of 20 women in each of 50 occu- that cover somewhat more specific occupational
pations, and men-in-general samples with occu- areas such as, agriculture, mechanical activities,
pation membership weighted equally, i.e., equal medical service, art, athletics, sales, and office
number of biologists, physicians, life insurance practices. These scales were developed by placing
salesmen, etc. together items that correlated .30 or higher with
each other. Thus these scales are homogeneous
and very consistent in content.
Administration. The Strong is not timed and The 211 Occupational Scales in the 1994
takes about 20 to 30 minutes to complete. It can revision cover 109 different occupations,
be administered individually or in groups, and from accountants to YMCA Directors, each
is basically a self-administered inventory. The scale developed empirically by comparing the
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

152 Part Two. Dimensions of Testing

responses of men and/or women employed in introverted with those of extroverted individu-
that occupation with the responses of a reference als, as defined by their scores on the MMPI scale
group of men, or of women, in general. For most of the same name. High scorers (introverts) pre-
of the occupations there is a scale normed on a fer working with things or ideas, while low scorers
male sample and a separate scale normed on a (extroverts) prefer working with people.
female sample. Why have separate gender scales?
The issue of gender differences is a complex Scores on the Strong are for the most part pre-
one, fraught with all sorts of social and political sented as T scores with a mean of 50 and SD of
repercussions. In fact, however, men and women 10.
respond differently to about half of the items
contained in the Strong, and therefore separate Interpretation of the profile. The resulting
scales and separate norms are needed (J. C. Strong profile presents a wealth of data, which
Hansen & D. P. Campbell, 1985). Most of these is both a positive feature and a negative one. The
samples were quite sizable, with an average close negative aspect comes about because the wealth
to 250 persons, and a mean age close to 40 years; of information provides data not just on the
to develop and norm these scales more than career interests of the client, but also on varied
142,000 individuals were tested. Some of the aspects of their personality, their psychological
smaller samples are quite unique and include functioning, and general psychic adjustment, and
astronauts, Pulitzer Prize-winning authors, thus demands a high degree of psychometric and
college football coaches, state governors, and psychological sophistication from the counselor
even Nobel prize winners (D. P. Campbell, in interpreting and communicating the results to
1971). Because these scales have been developed the client. Not all counselors have such a degree
empirically, they are factorially complex, most of training and sensitivity, and often the feedback
made up of rather heterogeneous items, and session to the client is less than satisfying (for
often with items that do not have face validity. some excellent suggestions regarding test inter-
The Psychologist Scale, for example, includes pretation and some illustrative case studies, see
items that reflect an interest in science, in the D. P. Campbell & J. C. Hansen, 1981).
arts, and in social service, as well as items having
to do with business and military activities, Criterion-keying. When the Strong was first
which are weighted negatively. Thus two people introduced in 1927, it pioneered the use of
with identical scores on this scale, may in fact criterion-keying of items, later incorporated into
have different patterns of responding. Though personality inventories such as the MMPI and
empirically these scales work well, it is difficult the CPI. Thus the Strong was administered to
for a counselor to understand the client unless groups of individuals in specific occupations, and
one undertakes an analysis of such differential their responses compared with those of “people
responding. However, by looking at the scores on in general.” Test items that showed differential
the Basic Interest Scales, mentioned above, one response patterns between a particular occupa-
can better determine where the client’s interests tional group, for example dentists, and people
lie, and thus better understand the results of the in general then became the dentist scale. Hun-
specific occupational scales. dreds of such occupational scales were developed,
Finally, there are the Special Scales. At present, based on the simple fact that individuals in dif-
two of these are included in routine scoring: ferent occupations have different career interests.
It is thus possible to administer the Strong to
1. The Academic Comfort Scale which was devel- an individual and determine that person’s degree
oped by contrasting the responses of high- of similarity between their career interests and
GPA students with low-GPA students; this scale those shown by individuals in specific careers.
attempts to differentiate between people who Thus each of the occupational scales is basically
enjoy being in an academic setting and those who a subset of items that show large differences in
do not. response percentages between individuals in that
2. The Introversion-Extroversion scale that was occupation and a general sample. How large is
developed by contrasting the responses of large? In general, items that show at least a 16%
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 153

difference are useful items; for example, if 58% of als are typically practical and physically oriented
the specific occupational sample respond “like” but may have difficulties expressing their feelings
to a particular item vs. 42% of the general sample, and concerns. They are less sociable and less given
that item is potentially useful (D. P. Campbell & to interpersonal interactions. Such occupations
J. C. Hansen, 1981). Note that one such item as engineer, vocational agriculture teacher, and
would not be very useful, but the average occupa- military officer are representative of this theme.
tional sample scale contains about 60 such items, Individuals whose career interests are high in
each contributing to the total scale. the investigative theme focus on science and sci-
entific activities. They enjoy investigative chal-
Gender bias. The earlier versions of the Strong lenges, particularly those that involve abstract
not only contained separate scoring for occupa- problems and the physical world. They do not like
tions based on the respondent’s gender, but the situations that are highly structured, and may be
separate gender booklets were printed in blue for quite original and creative in their ideas. They
males and pink for females! Thus, women’s career are typically intellectual, analytical, and often
interests were compared with those of nurses, quite independent. Occupations such as biolo-
school teachers, secretaries and other tradition- gist, mathematician, college professor, and psy-
ally “feminine” occupations. Fortunately, current chologist are representative of this theme.
versions of the Strong have done away with such As the name implies, the artistic theme cen-
sexism, have in fact pioneered gender equality in ters on artistic activities. Individuals with career
various aspects of the test, and provide substan- interests in this area value aesthetics and pre-
tial career information for both genders, and one fer self-expression through painting, words, and
test booklet. other artistic media. These individuals see them-
selves as imaginative and original, expressive, and
Holland’s theory. The earlier versions of the independent. Examples of specific careers that
Strong were guided primarily by empirical con- illustrate this theme are artist, musician, lawyer,
siderations, and occupational scales were devel- and librarian.
oped because there was a need for such scales. The fourth area is the social area; individuals
As these scales proliferated, it became appar- whose career interests fall under this theme are
ent that some organizing framework was needed people-oriented. They are typically sociable and
to group subsets of scales together. Strong and concerned about others. Their typical approach
others developed a number of such classifying to problem solving is through interpersonal pro-
schemas based on the intercorrelations of the cesses. Representative occupations here are guid-
occupational scales, on factor analysis, and on the ance counselor, elementary school teacher, nurse,
identification of homogeneous clusters of items. and minister.
In 1974, however, a number of changes were The enterprising area is the area of sales. Indi-
made, including the incorporation of Holland’s viduals whose career interests are high here see
(1966; 1973; 1985a) theoretical framework as a themselves as confident and dominant, like to be
way of organizing the test results. in charge, and to persuade others. They make use
Holland believes that individuals find spe- of good verbal skills, are extroverted, adventur-
cific careers attractive because of their person- ous, and prefer leadership roles. Typical occupa-
alities and background variables; he postulated tions include store manager, purchasing agent,
that all occupations could be conceptualized and personnel director.
as representing one of six general occupational Finally, the conventional theme focuses on
themes labeled realistic, investigative, artistic, the business world, especially those activities
social, enterprising, and conventional. that characterize office work. Individuals whose
Individuals whose career interests are high in career interests are high here are said to fit well in
the realistic area are typically aggressive persons large organizations and to be comfortable work-
who prefer concrete activities to abstract work. ing within a well-established chain of command,
They prefer occupations that involve working even though they do not seek leadership posi-
outdoors and working with tools and objects tions. Typically, they are practical and sociable,
rather than with ideas or people. These individu- well controlled and conservative. Representative
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

154 Part Two. Dimensions of Testing

occupations are those of accountant, secretary, SVIB was published, he administered the inven-
computer operator, and credit manager. tory to the senior class at Stanford University,
As the description of these types indicates, and 5 years later contacted them to determine
Holland’s model began its theoretical life as a per- which occupations they had entered, and how
sonality model. Like other personality typologies these occupations related to their scores on the
that have been developed, it is understood that inventory.
“pure” types are rare. But the different types are The criterion then for studying the predic-
differentiated: A person who represents the “con- tive validity of the Strong becomes the occupa-
ventional” type is quite different from the person tion that the person eventually enters. If some-
who is an “artistic” type. one becomes a physician and their Strong profile
Finally, there is a congruence between per- indicates a high score on the Physician scale, we
sonality and occupation resulting in satisfaction. then have a “hit.” The problem, however, is that
An artistic type of person will most likely not the world is complex and individuals do not nec-
find substantial satisfaction in being an accoun- essarily end up in the occupation for which they
tant. Holland’s theory is not the only theory of are best suited, or which they desire. As Strong
career development, but has been one of the (1935) argued, if final occupational choice is an
most influential, especially in terms of psycho- imperfect criterion, then a test that is validated
logical testing (for other points of view see Berg- against such a criterion must also be imperfect.
land, 1974; Gelatt, 1967; Krumboltz, Mitchell, & This of course is precisely the problem we dis-
Gelatt, 1975; Osipow, 1983; Tiedeman & O’Hara, cussed in Chapter 3; a test cannot be more valid
1963). than the criterion against which it is matched,
and in the real world there are few, if any, such
Reliability. The reliabilities associated with the criteria. Nevertheless, a number of studies both
Strong are quite substantial. D. P. Campbell and by Strong (1955) and others (e.g., D. P. Campbell,
J. C. Hansen (1981), for example, cite median 1971; Dolliver, Irwin, & Bigley, 1972) show sub-
test-retest correlations, with a 2-week interval the stantial predictive validity for the Strong, with a
r = .91, with a 2- to 5-year interval, the rs range typical hit rate (agreement between high score
from .70 to .78, and with a 20+ year interval, the on an Occupational Scale and entrance into that
rs range from .64 to .72. Not only is the Strong occupation) of at least 50% for both men and
relatively stable over time, so are career interests. women. There is of course something reassur-
Test-retest reliabilities for the Basic Interest ing that the hit rates are not higher; for one
Scales are quite substantial, with median coef- thing it means that specific occupations do attract
ficients of .91 for a 2-week period, .88 for people with different ideas and interests, and
1 month-, and .82 for 3-year periods. Test-retest such variability keeps occupations vibrant and
correlations also vary with the age of the sam- growing.
ple, with the results showing less reliability with
younger samples, for example 16-year-olds, as Faking. In most situations where the Strong is
might be expected. administered, there is little if any motivation to
fake the results because the client is usually taking
Validity. The Basic Interest Scales have substan- the inventory for their own enhancement. There
tial content and concurrent validity; that is, their may be occasions, however, when the Strong is
content makes sense, and a number of studies administered as part of an application process;
have shown that these scales do indeed discrim- there may be potential for faking in the applica-
inate between persons in different occupations. tion for a specific occupation or perhaps entrance
In general, their predictive validity is not as high, into a professional school.
and some scales seem to be related to other vari- Over the years, a number of investigators have
ables rather than occupational choice; for exam- looked at this topic, primarily by administer-
ple, the Adventure Scale seems to reflect age, with ing the Strong twice to a sample of subjects,
older individuals scoring lower. first under standard instructions, and secondly
Strong was highly empirically oriented and with instructions to fake in a specific way, for
developed not just an inventory, but a rich example, “fake good to get higher scores on engi-
source of longitudinal data. For example, after the neering” (e.g., Garry, 1953; Wallace, 1950). The
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 155

results basically support the notion that under for a new occupation of “virtual reality trainer”
such instructions Strong results can be changed. (VRT). We administer the Strong, which repre-
Most of these studies represent artificial situ- sents a pool of items and an “open” system, to
ations where captive subjects are instructed to a group of VRTs and a group of “people in gen-
fake. What happens in real life? D. P. Campbell eral,” and identify those items that statistically
(1971) reports the results of a doctoral disserta- separate the two groups.
tion that compared the Strong profiles of 278 Uni- Let’s say for example, that 85% of our VRTs
versity of Minnesota males who had completed indicate like to the item “computer program-
the Strong first for counseling purposes and later mer” vs. only 10% for the general sample, and
had completed the Strong a second time as part that 80% of the VRTs also indicate dislike to the
of their application procedure to the University item “philosopher” vs. 55% for the general sam-
of Minnesota medical school. Presumably, when ple. Both items show a significant difference in
the Strong was taken for counseling purposes the response pattern and so both would be included
respondents completed the inventory honestly, in our scale. But clearly, one item is more “power-
but when the Strong was taken as part of an appli- ful,” one item shows a greater difference between
cation process, faking might have occurred, espe- our two groups, and so we might logically argue
cially on those items possibly related to a career that such an item should be given greater weight
in medicine. In fact, for 47% of the sample, there in the way the scale is scored. That indeed is what
was no difference on their physician scale score Strong originally did; the items were weighted
between the two administrations. For 29%, there based on a ratio of the response percentage of
was an increase, but not substantial. For 24%, the specific occupational sample vs. the response
there was a substantial increase, enough to have a percentage of the general sample. And so initially,
“serious effect” on its interpretation by an admis- Strong items were scored with weights ranging
sions officer. Of course, just because there was an from −30 to +30. Such scoring, especially in
increase does not mean that the individual faked; the precomputer days, was extremely cumber-
the increase might well reflect legitimate growth some, and so was simplified several times until
in medical interest. There are three points to be in 1966 the weights of +1, 0, or −1 were used.
made here: (1) faking is possible on the Strong, Empirical studies of unit weights vs. variable
(2) massive distortions do not usually occur, (3) weights show the unitary weights to be just as
the resulting profile typically shows considerable valid.
consistency over time.
Percentage overlap. Another interesting con-
Inconsistencies. Because the Strong contains cept illustrated by the Strong is that of percentage
different sets of scales developed in different overlap. Let’s assume we have administered the
ways, it is not unusual for a client’s results Strong to two groups of individuals, and we are
to reflect some inconsistencies. R. W. John- interested in looking at a specific occupational
son (1972) reported that some 20% of profiles scale for which our theory dictates the two sam-
have at least one or more such inconsistencies ples should differ. How do we determine whether
between Occupational Scales and Basic Interest the two groups differ? Ordinarily we would carry
Scales. D. P. Campbell and J. C. Hansen (1981) out a t test or an analysis of variance to assess
argue that such inconsistencies are meaningful whether the means of the two groups are statis-
and result in more accurate test interpretation tically different from each other (you recall, by
because they force both the counselor and the the way, that when we have two groups, the two
client to understand the meaning of the scales procedures are the same in that t2 = F). Such a
and to go beyond the mere occupational label. For procedure tells us that yes (or no) there is a dif-
example, the Basic Interest Scales reflect not only ference, but it doesn’t really tell us how big that
career interests but leisure interests as well (Cairo, difference is, and does not address the issue of
1979). practicality – a small mean difference could be
statistically significant if we have large enough
Unit weighting. The Strong illustrates nicely the samples, but would not necessarily be useful.
concept of unit weights as opposed to variable A somewhat different approach was suggested
weights. Let’s suppose we are developing a scale by Tilton (1937) who presented the statistic of
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

156 Part Two. Dimensions of Testing

X1 X2
FIGURE 6–3. Two distributions separated from each other by two standard deviations.

percent overlap, which is simply the percent- Racial differences. Although racial differences
age of scores in one sample that are matched by on the Strong have not been studied extensively,
scores in the second sample. If the two distri- and in fact the SVIB Handbook (D. P. Campbell,
butions of scores are totally different and there- 1971) does not discuss this topic, the available
fore don’t overlap, the statistic is zero. If the two studies (e.g., Barnette & McCall, 1964; Borgen
distributions are identical and completely over- & Harper, 1973) indicate that the Strong is not
lap, then the statistic is 100%. If the intent of racially biased and that its predictive validity and
a scale is to distinguish between two groups, other psychometric aspects for minority groups
then clearly the lower the percentage overlap, the are equivalent to those for whites.
more efficient (valid) is the scale. Tilton called
this statistic the Q index, and it is calculated as
follows: Item response distribution. D. P. Campbell and
M1 − M2 J. C. Hansen (1981) indicate that interest mea-
Q= surement is based on two empirical findings: (1)
2(SD1 + SD2 )
different people give different responses to the
Once Q is computed, the percent overlap can be individual items; and (2) people who are satisfied
determined using Tilton’s (1937) table. Essen- with their particular occupation tend to respond
tially, the Q index is a measure of the number to particular items in a characteristic way. Given
of standard deviation units that separate the two these two statements, the item response distribu-
distributions. For example, a Q value of 2 rep- tion for a particular item charts the value of that
resents two distributions that are separated from item and its potential usefulness in the inventory.
each other by two standard deviations, and have At the Center for Interest Measurement Research
an overlap of about 32%. Figure 6.3 illustrates of the University of Minnesota extensive data on
this. Note that if our occupational scale were the Strong is stored on computer archives, going
an IQ test, the means of the two groups would back to the original samples tested by Strong. For
differ by about 30 points – a rather substantial example, D. P. Campbell and J. C. Hansen (1981)
difference. The median percent overlap for the show the item response distribution for the item
Strong occupational scales is in fact about 34%. “artist” given by some 438 samples, each sample
This is of course a way of expressing concurrent typically ranging from less than 100 to more than
validity. Scales that reflect well-defined occupa- 1,000 individuals in a specific occupation. Both
tions such as physicist or chemist, have the lowest male and female artist samples tend to show near
overlap or highest validity. Scales that assess less unanimity in their endorsement of “like”; at the
well-defined occupations, such as that of college other extreme, male farmers show an 11% “like”
professor, have a higher degree of overlap and, response, and females in life insurance sales a 32%
therefore, lower concurrent validity. “like” response.
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 157

Longitudinal studies. The Strong has been used for use with junior and senior high-school stu-
in a number of longitudinal studies, and specifi- dents in grades 6 through 12; and (3) the Kuder
cally to assess the stability of vocational interests Occupational Interest Survey (KOIS), designed
within occupations over long time spans. D. P. for grades 10 through adulthood. The first two
Campbell (1966) asked and answered three basic yield scores in 10 general areas, namely: artis-
questions: (1) Do Strong scales developed in the tic, clerical, computational, literary, mechani-
1930s hold up in cross-validation years later? The cal, musical, outdoor, persuasive, scientific, and
answer is yes; (2) When Strong scales have been social service. The third, the KOIS, yields sub-
revised, did the revised scales differ drastically stantially more information, and our discussion
from the originals? The answer is not much; (3) will focus primarily on this instrument. Once
Do the individuals of today who hold the same again, we use the term “Kuder” as a more generic
job as the individuals in Strong’s criterion groups designation, except where this would violate the
of the 1930s have the same interest patterns? The meaning.
answer is pretty much so.
Development. Initially, the Strong and the
Inventoried vs. expressed interests. “Invento- Kuder represented very different approaches. The
ried” interests are assessed by an inventory such Strong reflected criterion-group scaling while the
as the Strong. “Expressed” interests refer to the Kuder represented homogeneous scaling, that is
client’s direct comments such as, “I want to be an clustering of items that are related. Over the years
engineer” or “I am going to study environmen- however, the two approaches have borrowed
tal law.” How do these two methods compare? heavily from each other, and thus have become
If for example, the Strong were simply to mirror more convergent in approach and process.
the client’s expressed interests, why waste time
and money, when the same information could Description. The KOIS takes about 30 minutes to
be obtained more directly by simply asking the complete and is not timed. It can be administered
subject what they want to be. Of course, there to one individual or to a large group at one sitting.
are many people who do not know what career Like the Strong, it too must be computer scored.
to pursue, and so one benefit of the Strong and The KOIS is applicable to high-school students
similar instruments, is that it provides substantial in the 10th grade or beyond (Zytowski, 1981). In
exploratory information. Berdie (1950) reported addition to 126 occupational scales, the KOIS also
correlations of about .50 in studies that compared has 48 college-major scales. The KOIS also has a
inventoried and expressed interests. However, number of validity indices, similar to the Strong’s
Dolliver (1969) pointed out that this deceptively administrative indices, including an index that
simple question actually involves some complex reflects number of items left blank, and a verifi-
issues including the reliability and validity of both cation score that is basically a “fake good” scale. As
the inventory and the method by which expressed with most other major commercially published
interests are assessed, attrition of subjects upon tests, there is not only a manual (Kuder & Dia-
follow-up, and the role of chance in assessing such mond, 1979), but additional materials available
results. for the practitioner (e.g., Zytowski, 1981; 1985).

Scale development. We saw that in the Strong,


The Kuder Inventories
occupational scales were developed by pooling
Introduction. A second set of career-interest those 40 to 60 items in which the response pro-
inventories that have dominated psychological portions of an occupational group and an in-
testing in this area, has been the inventories devel- general group differed, usually by at least 16%.
oped by Frederic Kuder. There are actually three The Kuder took a different approach. The Kuder
Kuder inventories: (1) the Kuder Vocational Pref- was originally developed by administering a list
erence Record (KVPR), which is used for career of statements to a group of college students, and
counseling of high-school students and adults; based on their responses, placing the items into
(2) the Kuder General Interest Survey (KGIS), 10 homogeneous scales. Items within a scale cor-
which is a downward extension of the KVPR, related highly with each other, but not with items
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

158 Part Two. Dimensions of Testing

in the other scales. Items were then placed in tri- Scoring. Scales on the KOIS are scored by means
ads, each triad reflecting three different scales. of a “lambda” score, which is a modified biserial
The respondent indicates which item is most pre- correlation coefficient and is essentially an index
ferred and which item is least preferred. Note that of similarity between a person’s responses and
this results in an ipsative instrument – one can- the criterion group for each scale. Rather than
not obtain all high scores. To the degree that one interpreting these lambda scores directly, they are
scale score is high, the other scale scores must be used to rank order the scales to show the mag-
lower. nitude of similarity. Thus, the profile sheet that
Let’s assume we are developing a new occupa- summarizes the test results is essentially a listing
tional scale on the Kuder for “limousine driver,” of general occupational interests (e.g., scientific,
and find that our sample of limousine drivers artistic, computational), and of occupations and
endorses the first triad as follows: of college majors, all listed in decreasing order of
Item # Most preferred Least preferred
similarity.
1 20% 70%
Reliability. Test-retest reliabilities seem quite
2 60% 15%
3 20% 15%
acceptable, with for example median reliability
coefficients in the .80s over both a 2-week and a
That is, 20% of our sample selected item #1 as 3-year period (Zytowski, 1985).
most preferred, 60% selected item 2, and 20%
item 3; similarly, 70% selected item #1 as least Validity. Predictive validity also seems to be
preferred, 15% item 2, and 15% item 3. acceptable, with a number of studies showing
If you were to take the Kuder, your score for about a 50% congruence between test results and
the first triad on the “limousine driver” scale subsequent entrance into an occupation some 12
would be the proportion of the criterion group to 19 years later (Zytowski & Laing, 1978).
that endorsed the same responses. So if you indi-
cated that item #1 is your most preferred and
Other Interest Inventories
item #2 is your least preferred, your score on
that triad would be .20 + .15 = .35. The high- A large number of other inventories have been
est score would be obtained if you endorsed item developed over the years, although none have
#2 as most and item #1 as least; your score in reached the status of the Strong or the Kuder.
this case would be .60 + .70 = 1.30. Note that Among the more popular ones are the Holland
this triad would be scored differently for different Self-Directed Search (Holland, 1985b), the Jack-
scales because the proportions of endorsement son Vocational Interest Survey (D. N. Jackson,
would presumably change for different occupa- 1977), the Career Assessment Inventory (Johans-
tional groups. Note also that with this approach son, 1975), the Unisex edition of the ACT Inter-
there is no need to have a general group. est Inventory (Lamb & Prediger, 1981), and the
In any one occupational group, we would Vocational Interest Inventory (P. W. Lunneborg,
expect a response pattern that reflects homo- 1979).
geneity of interest, as in our fictitious example,
where the majority of limousine drivers agree Interest inventories for disadvantaged. A
on what they prefer most and prefer least. If we number of interest inventories have been devel-
did not have such unanimity we would expect oped for use with clients who, for a variety of
a “random” response pattern, where each item reasons, may not be able to understand and/or
in the triad is endorsed by approximately one- respond appropriately to verbal items such as
third of the respondents. In fact, we can calcu- those used in the Strong and the Kuder. These
late a total score across all triads that reflects inventories, such as the Wide Range Interest
the homogeneity of interest for a particular Opinion Test (J. F. Jastak & S. R. Jastak, 1972)
group, and whether a particular Kuder scale dif- and the Geist Picture Interest Inventory (Geist,
ferentiates one occupational group from others 1959) use drawings of people or activities related
(see Zytowski & Kuder, 1986, on how this is to occupational tasks such as doing laundry,
done). taking care of animals, serving food, and similar
P1: JZP
0521861810c06a CB1038/Domino 0 521 86181 0 February 24, 2006 14:22

Attitudes, Values, and Interests 159

activities. Some of these inventories use a forced- tencies, and decision-making skills, all in rela-
choice format, while others ask the respondent tion to career choice. For example, the Career
to indicate how much they like each activity. Decision Scale (Osipow, 1987) is an 18-item scale
Most of these have adequate reliability but leave designed to assess career indecision in college stu-
much to be desired in the area of validity. dents, and the Career Maturity Inventory (Crites,
1978) is designed to assess career-choice compe-
Interest inventories for nonprofessional occu- tencies (such as self-knowledge and awareness of
pations. Both the Strong and the Kuder have one’s interests) and attitudes (such as degree of
found their primary application with college- independence and involvement in making career
bound students and adults whose expectations decisions).
are to enter professions. In fact, most career-
interest inventories are designed for occupations
Lack of theory. One criticism of the entire field
that are entered by middle-class individuals. In
of career-interest measurement is that it has
large part, this reflects a reality of our culture, that
been dominated by an empirical approach. The
perhaps is changing. At least in the past, individu-
approach has been highly successful, yet it has
als from lower socioeconomic classes did not have
resulted in a severe lack of theoretical knowl-
much choice, and job selection was often a mat-
edge about various aspects of career interests. For
ter of availability and financial need. For upper
example, how do these interests develop? What
socioeconomic class individuals, their choice was
psychological processes mediate and affect such
similarly limited by family expectations and tra-
interests? How do such variables as personality,
ditions, such as continuing a family business
temperament, and motivation relate to career
or family involvement in government service. A
interests? There is now a need to focus on con-
number of other interest inventories have been
struct validity rather than criterion validity. To
developed that are geared more for individuals
be sure, such questions have not been totally dis-
entering nonprofessional occupations.
regarded. For example, Roe (Roe & Klos, 1969;
One example of such inventories is the Career
Roe & Siegelman, 1964) felt that career choices
Assessment Inventory (CAI; Johansson, 1986),
reflected early upbringing and that children who
first introduced in 1975 and subsequently revised
were raised in an accepting and warm family
several times, recently to include both nonpro-
atmosphere would choose people-oriented occu-
fessional and professional occupations. The CAI
pations. Others have looked to a genetic compo-
currently contains some 370 items similar in con-
nent of career interests (e.g., Grotevant, Scarr, &
tent to those of the Strong and takes about 40
Weinberg, 1977).
to 45 minutes to complete. For each item, the
client responds on a 5-point scale ranging from
“like very much” to “dislike very much.” Like the New occupational scales. The world of work
Strong, the CAI contains the six general-theme is not a static one, and especially in a rapidly
scales that reflect Holland’s typology, 25 Basic expanding technology, new occupations are cre-
Interest scales (e.g., electronics, food service, ated. Should we therefore continue to develop
athletics-sports), and 111 occupational scales new occupational scales? Some authors (e.g.,
(such as accountant, barber/hairstylist, carpen- Borgen, 1986; Burisch, 1984) have argued that
ter, fire-fighter, interior designer, medical assis- a simple, deductive approach to career inter-
tant, police officer, and truck driver). Although est measurement may be now more produc-
the CAI seems promising, in that it was well- tive than the empirical and technical develop-
designed psychometrically and shows adequate ment of new scales. These authors believe that
reliability, it too has been criticized, primarily we now have both the theories and the empiri-
for lack of evidence of validity (McCabe, 1985; cal knowledge related to the occupational world,
Rounds, 1989). and we should be able to locate any new occu-
pation in that framework without needing to
Other career aspects. In addition to career go out and develop a new scale. Indeed, Borgen
interests, there are a number of questionnaires (1986) argues that occupational scales may not
designed to assess a person’s attitudes, compe- be needed and that a broad perspective, such as
P1: JZP
0521861810c06b CB1038/Domino 0 521 86181 0 January 27, 2006 8:41

160 Part Two. Dimensions of Testing

the one provided by Holland’s theory, is all that Kilpatrick, F. P. & Cantril, H. (1960). Self-anchoring
is needed. scaling: A measure of individual’s unique reality
worlds. Journal of Individual Psychology, 16, 158–173.
An interesting report where the two authors present the
SUMMARY self-anchoring methodology and the results of several stud-
ies where such scales were administered to adult Ameri-
We have looked at the measurement of attitudes, cans, legislators from seven different countries, college stu-
values, and interests. From a psychometric point dents in India, and members of the Bantu tribe in South
of view these three areas share much in common, Africa.
and what has been covered under one topic could, Lawton, M. P. & Brody, E. M. (1969). Assessment of
in many instances be covered under a different older people: Self-maintaining and instrumental activ-
topic. We looked at four classical methods to con- ities of daily living. The Gerontologist, 9, 179–186.
struct attitude scales; the method of equal appear- Two scales are presented for use with institutionalized elderly.
ing intervals or Thurstone method, the method The first scale focuses on physical self-maintenance and covers
of summated ratings or Likert method, the Bog- six areas, toilet use, feeding, dressing, grooming, and physi-
cal ambulation. The second scale, Instrumental Activities of
ardus social distance scale, and Guttman scaling.
Daily Living, covers eight areas ranging from the ability to
In addition, we looked at the Semantic Differ- use the telephone to the ability to handle finances. The scale
ential, checklists, numerical and graphic rating items are given in this article and clearly illustrate the nature
scales, and self-anchoring scales. of Guttman scaling, although the focus is on how the scales
can be used in various settings, rather than in how the scales
In the area of values, we looked at the Study were developed.
of Values, a measure that enjoyed a great deal
of popularity years ago, and the Rokeach Value Rokeach, M. & Ball-Rokeach, S. J. (1989). Stability and
Survey, which is quite popular now. We also change in American value priorities, 1968–1981. Amer-
ican Psychologist, 44, 775–784.
briefly discussed the Survey of Interpersonal Val-
ues and the Survey of Personal Values to illustrate Psychological tests can be useful not only to study the func-
tioning of individuals, but to assess an entire society. In this
another approach. In the area of career interests, report, the authors analyze national data on the RVS which
we focused primarily on the Strong and the Kuder was administered by the National Opinion Research Center
inventories that originally represented quite dif- of the University of Chicago in 1968 and again in 1971, and by
ferent approaches but in recent revisions have the Institute for Social Research at the University of Michigan
in 1974 and in 1981. Although there seems to be remarkable
become more alike. stability of values over time, there were also some significant
changes – for example, “equality” decreased significantly.

SUGGESTED READINGS
DISCUSSION QUESTIONS
Campbell, D. P. (1971). An informal history of the
SVIB. In Handbook for the Strong Vocational Interest 1. What are some of the strengths and weaknesses
Blank (pp. 343–365). Stanford, CA: Stanford Univer- of attitude scales?
sity Press. 2. Which of the various ways of assessing relia-
This is a fascinating account of the SVIB, from its early begin- bility would be most appropriate for a Guttman
nings in the 1920s to the mid 1960s, shortly after Strong’s
scale?
death. For those who assume that computers have been avail-
able “forever,” this chapter has a wonderful description of the 3. What might be some good bipolar adjectives
challenges required to “machine score” a test. to use in a Semantic Differential scale to rate “my
Domino, G., Gibson, L., Poling, S., & Westlake, L. best teacher”?
(1980). Students’ attitudes towards suicide. Social 4. What are the basic values important to col-
Psychiatry, 15, 127–130. lege students today? Are these included in the
The investigators looked at the attitudes that college students Rokeach?
have toward suicide. They used the Suicide Opinion Ques-
5. Most students take some type of career-
tionnaire, and administered it to some 800 college students
in nine different institutions. An interesting study illustrating interest test in high school. What is your recol-
the practical application of an attitude scale. lection of such a test and the results?
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

7 Psychopathology

AIM In this chapter we look at testing as applied to psychopathology. We briefly cover


some issues of definition and nosology, and then we look at 11 different instruments,
each selected for specific purposes. First, we look at two screening inventories, the
SCLR-90 and the PSI. Then we look at three multivariate instruments: two of these,
the MMPI and the MCMI, are major instruments well known to most clinicians, and
the third is new and unknown. Next, we look at an example of a test that focuses on
a specific aspects of psychopathology – the schizoid personality disorder. Finally, we
look at a measure of anxiety and three measures of depression used quite frequently
by clinicians and researchers. More important than each specific test, are the kinds
of issues and construction methodologies they represent, especially in relation to the
basic issues covered in Chapter 3.

(fire-setting) in the DSM you would find it listed


INTRODUCTION
under “Impulse Control Disorders (not classified
The Diagnostic and Statistical Manual (DSM). elsewhere),” and the number assigned is 312.33.
As you are well aware, there is a wide range of The DSM is based on a medical model of behav-
physical illnesses that can affect humans. These ioral and emotional problems, which views such
illnesses are classified, under various headings, in disturbances as illnesses. Part of the model is
the International Classification of Diseases, a sort to perceive these problems as residing within
of dictionary of illnesses which also gives each the individual, rather than as the result and
illness a particular classificatory number. Thus interplay of environmental, familial, or cultural
physicians, clinics, insurance companies, govern- aspects.
mental agencies, etc., all over the world, have Later revisions of the DSM use a multiax-
a uniform system for reporting and classifying ial approach – that is, individuals are classified
illnesses. according to five different axes or dimensions.
A similar approach applies to mental illnesses, The first dimension refers to clinical syndromes.
and here the classificatory schema is called the The second dimension pertains to developmen-
Diagnostic and Statistical Manual of Mental Dis- tal disorders and personality disorders. Axis III
orders, or DSM for short. The first DSM was pub- refers to physical disorders and conditions, axis
lished in 1952, and revisions, are made as needed. IV to severity of psychosocial stressors, and axis
This classificatory schema is thus not static but V to level of adaptive functioning. The DSM is
is continually undergoing study and revision. a guide rather than a cookbook; it is intended
In the DSM, each diagnostic label is numbered, to assist the clinician in making a diagnosis and
and these numbers coincide with those used in in communicating with other professionals (for
the International Classification of Diseases. Thus a reliability and validity analysis of the DSM-IV,
for example, if you were to look for pyromania see Nelson-Gray, 1991).
161
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

162 Part Two. Dimensions of Testing

Mental disorder. There are many phrases used use an interview or a self-report inventory. A
for the area under consideration, such as “men- second way is to ask someone else who knows
tal illness,” “psychopathology,” and so on. For the person (Does your husband hear voices?)
our purposes, we will follow the DSM approach Here we can interview spouses, parents, teach-
and define mental disorder as a psychological ers, etc. and/or ask them to complete some rat-
pattern associated with distress and/or impair- ing scale. A third way is to observe the person
ment of functioning that reflects dysfunction in in their natural environment. This might include
the person. home visits, observing a patient interact with oth-
ers on the ward, or observing children playing
Psychiatric diagnosis. In the assessment of psy- in the playground. Finally, we can observe the
chopathology, psychiatric diagnosis has occupied person in a standardized-test situation. This is
a central position, and often it has become the cri- probably the most common method and involves
terion against which any measuring instrument tests such as those discussed below. For a com-
was matched. In the past, however, psychiatric pendium of scales that measure specific aspects
diagnosis was seen as highly unreliable and was of psychopathology, following the DSM system,
even described as a hopeless undertaking. The see Schutte and Malouff (1995).
development of the DSM with its well-defined
and specified criteria, as well as the publication Cost factors. Testing is expensive whether it is
of several structured and semistructured inter- done by an in-house professional or an outside
views, provided clinicians with the needed tools, consultant. Expensive means not only money
so that today diagnostic unreliability is much less that might be associated with the cost of the
of a concern. test itself and the salary of the associated per-
sonnel, but also the time commitment involved
Diagnostic classification. Tests are often used to on the part of the client and staff. That is why, in
provide a diagnosis for a particular patient or part, tests that are paper-and-pencil, brief, com-
to classify that patient in some way. Traditional prehensive, self-administered, and objectively
psychiatric systems of diagnosis, and indeed the scored are often preferred by clinicians.
very process of classification, have been severely Given these economic concerns, tests are not
criticized over the years (e.g., Kauffman, 1989), ordinarily used in a routine fashion but are used
but classification is both needed and useful. A only when their potential contribution is sub-
diagnosis is a shorthand description that allows stantially greater than their cost. In addition, we
professionals to communicate, to match services need to consider the complexity of the issues con-
to the needs of the patient, to further understand cerning potential testing. For example, if a deci-
particular syndromes, and as a statistical report- sion needs to be made about placing a client in
ing device for governmental action, such as allow- a long-term institutional setting, as much infor-
ing funds to be used for the benefit of a client. mation as possible needs to be gathered, and test
results can be useful, especially when they can
Differential diagnosis. Often diagnosis involves provide new information not readily available
the assignment of one of several potentially appli- by other means. Also, tests of psychopathology
cable labels. Diagnoses are often quite difficult are often used when there are difficult diagnos-
to make, particularly in psychological conditions tic questions about a client, rather than routine
where the indicators of a particular syndrome “textbook” type decisions. In a sense, the use of
may not be clearly differentiable from the indica- psychological tests parallels the use of medical
tors of another syndrome. Test information can tests – if you have a simple cold, you would not
be quite useful in such situations, and the utility expect your physician to carry out brain scans,
of a test can be judged in part from its differential spinal taps, or other sophisticated, costly, and
diagnostic validity. invasive procedures.

Assessment. Basically there are four ways we The use of test batteries. Although tests are
can assess an individual. We can ask that person used for a variety of purposes in the area of psy-
directly (Do you hear voices?). Here we might chopathology, their use often falls into one of two
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 163

categories: (1) a need to answer a very specific facial expressions, mannerisms, etc. Does the
and focused diagnostic question (e.g., does this client act in bizarre or suspicious ways?
patient represent a suicide risk?); or (2) a need 3. Orientation: Does the client know who he is,
to portray in a very broad way the client’s psy- where he is, the time (year, month, day) and why
cho – dynamics, psychological functioning, and he is there.
personality structure. The answer to the first cat- 4. Memory: Is typically divided into immediate
egory can sometimes be given by using a very spe- (ability to recall within 10 seconds of presenta-
cific, focused test – in the example given above, tion), recent (within the recent past, a few days
perhaps a scale of suicidal ideation. For the sec- to a few months), and remote memory (such as
ond category, the answer is provided either by a past employment, family deaths).
multivariate instrument, like the MMPI, or a test
5. Sensorium: Is the degree of intactness of the
battery, a group of tests chosen by the clinician to
senses (such as vision and touch), as well as to
provide potential answers.
general ability to attend and concentrate.
Sometimes test batteries are routinely admin-
istered to new clients in a setting for research 6. Mood and Affect: Mood refers to the general
purposes, for evaluation of the effectiveness of or prevailing emotion displayed during the MSE,
specific therapeutic programs, or to have a uni- while affect refers to the range of emotions man-
form set of data on all clients so that base rates, ifested during the MSE.
diagnostic questions, and other aspects can be 7. Intellectual functioning: the client’s verbal
determined. The use of a test battery has a num- ability, general fund of information, ability to
ber of advantages other than simply an increased interpret abstract proverbs, etc.
number of tests. For one, differences in perfor- 8. Perceptual processes: Is veridical perception of
mance on different tests may have diagnostic sig- the world vs. hallucinations.
nificance (a notion similar to scatter discussed 9. Thought content: the client’s own ideas about
in Chapter 5). If we consider test results as indi- current difficulties; presence of persecutory delu-
cators of potential hypotheses (e.g., this client sions, obsessions, phobias, etc.
seems to have difficulties solving problems that 10. Thought process, also known as stream of
require spatial reasoning), then the clinician can consciousness: an assessment of the language
look for supporting evidence among the variety process as it reflects the underlying thought pro-
of test results obtained. cesses, for example paucity of ideas, giving a lot
of irrelevant detail, getting sidetracked, degree of
The Mental Status Exam (MSE). Traditionally, insight shown by the client as to the nature of his
psychiatric diagnosis is based on the MSE, which or her problems, etc.
is a psychiatrist’s analogue to the general physical
exam used by physicians. Like a medical exam, The above outline is not rigidly followed, but
the MSE is not rigorously standardized and is represents the kind of information that would
highly subjective. Basically the MSE consists of an be elicited during the course of an interview.
interview, during which the psychiatrist observes Although the MSE is widely used, it is not a well-
the patient and asks a number of questions. The standardized procedure and quite a few varia-
psychiatrist conducting the MSE tries to obtain tions exist (see Crary & C. W. Johnson, 1975;
information on the client’s level of functioning W. R. Johnson, 1981; Kahn, Goldfarb, Pollack
in about 10 areas: et al., 1960; Maloney & Ward, 1976; Rosenbaum
& Beebe, 1975).
1. Appearance: How does the client look? Is the
client’s appearance appropriate given his or her
age, social position, educational background, etc. MEASURES
Is the client reasonably groomed?
The Structured Clinical Interview for
2. Behavior: What is the client’s behavior during
DSM-III (SCID)
the MSE? This includes verbal behavior such as
tone of voice, general flow, vocabulary, etc., and Brief mention should be made of the SCID
nonverbal behavior such as posture, eye contact, because it is an outgrowth of the DSM and is of
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

164 Part Two. Dimensions of Testing

importance to the assessment of psychopathol- and two forms for use by clinical observers (Dero-
ogy. The SCID is a semistructured interview, one gatis, 1977).
of the first to be specifically designed on the basis
of DSM-III criteria for mental disorders (Spitzer Development. As indicated, the SCL-90R
& Williams, 1983), and subsequently updated to evolved from other checklists of symptoms that
reflect the changes in the revisions of the DSM reflected years of diagnostic observations on the
(Spitzer, Williams, Gibbon, et al., 1992). part of many clinicians. Factor analytic studies of
The SCID covers nine diagnostic areas includ- the Hopkins Symptom Checklist identified five
ing psychotic disorders, mood disorders, anxiety primary symptom dimensions. Four additional,
disorders, and eating disorders. The SCID should rationally developed, symptom dimensions were
be administered by trained interviewers who have added and the result was the SCL-90. An attempt
a background in psychopathology and are famil- was made to use simple phrases as the checklist
iar with DSM criteria. However, Segal, Hersen, items, and to keep the general vocabulary level
Van Hasselt, et al. (1993), in a study of elderly as simple as possible.
patients, used master’s level graduate students to
administer the SCID. These authors argued that Description. The patient is asked to indicate for
in a typical agency it is most likely the less expe- each of the 90 items, the degree of distress expe-
rienced clinicians who do the diagnostic work. rienced during the past 7 days, using a 5-point
In fact, their results showed an interrater agree- scale (0 to 4) that ranges from “not at all” to
ment rate greater than 85%. The reliability of any “extremely.” The SCL-90R can be scored for nine
structured interview is not a static number, but symptom dimensions, and these are presented in
changes from study to study because it is affected Table 7.1. The SCL-90R is primarily applicable to
by many variables such as aspects of the inter- adults, but can be used with adolescents, perhaps
viewers and of the subjects, as well as the reli- even with 13- and 14-year-olds.
ability of the specific diagnostic criteria (Segal,
Hersen, Van Hasselt, et al., 1994; J. B. W. Williams Administration. The SCL-90R contains clear
et al., 1992). directions and is relatively easy to administer,
There is a shortened version of the SCID for often by a nurse, technician, or research assis-
use in settings where psychotic disorders are rare, tant. Most patients can complete the SCL-90R in
and a form designed for use with nonpsychiatric 10 to 15 minutes. The items on the scale can also
patients, such as might be needed in community be read to the patient, in cases where trauma or
surveys of mental health. As with other major other conditions do not permit standard admin-
instruments, there are user’s guides and com- istration. A 3 × 5 card with the response options
puter scoring programs available. is given to the patient, and the patient can indicate
a response by pointing or raising an appropriate
number of fingers.
The Symptom Checklist 90R (SCL-90R)
Introduction. The SCL-90R is probably one of Scoring. Raw scores are calculated by adding
the most commonly used screening inventories the responses for each symptom dimension, and
for the assessment of psychological difficulties. dividing by the number of items in that dimen-
The SCL-90R evolved from some prior check- sion. In addition to the nine scales, there are three
lists called the Hopkins Symptom Checklist and global indices that are computed. The Global
the Cornell Medical Index (Wider, 1948). As its Severity Index (GSI), is the sum of all the nonzero
title indicates, the SCL-90R is a self-report inven- responses, divided by 90, and reflects both the
tory of symptoms, covering nine psychiatric cate- number of symptoms endorsed and the inten-
gories such as depression and paranoid ideation, sity of perceived distress. The Positive Symptom
and focusing particularly on those symptoms Total (PST), is defined as the number of symp-
exhibited by psychiatric patients and, to a lesser toms out of the 90 to which the patient indicates a
extent, by medical patients. A preliminary form nonzero response. This is a measure of the num-
was developed in 1973, and a revised form in ber of symptoms endorsed. The Positive Symp-
1976; in addition there is a brief form available tom Distress Index (PSDI), is defined as the PST
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 165

Table 7–1 Scales on the SCL-90


No. of
items Name Definition Example
12 Somatization Distress arising from Headaches
perceptions of bodily A lump in your throat
dysfunction
10 Obsessive-compulsive Unwarranted but repetitive Having to check and double
thoughts, impulses, and check what you do
actions
9 Interpersonal sensitivity Feelings of personal Feeling shy
inadequacy and inferiority Feeling inferior to others
13 Depression Dysphoric mood and Crying easily
withdrawal Feeling blue
10 Anxiety Nervousness, tension, feelings Trembling
of apprehension Feeling fearful
6 Hostility Anger Feeling easily annoyed
Shouting/throwing things
7 Phobic anxiety Persistent and irrational fear Feeling afraid of open spaces
response Feeling uneasy in crowds
6 Paranoid ideation Suspicious, grandiosity, and Feeling others are to blame for
delusions one’s troubles
Feeling that most people
can’t be trusted
10 Psychoticism Withdrawn, isolated lifestyle Someone controls my thoughts
with symptoms of I hear voices that others do
schizophrenia not
Note: If you yourself are a bit “obsessive-compulsive,” you will note that the above items only add up to 83. There
are in fact 7 additional items to the SCL-90 that are considered to reflect clinically important symptoms, such as poor
appetite and trouble falling asleep, that are not subsumed under any of the 9 primary dimensions.

divided by 90; thus, this is a measure of “inten- the SCL-90 Depression Scale (.55), the Obsessive-
sity” corrected for the number of symptoms. Compulsive Scale (.57), the Anxiety Scale (.51),
and the Interpersonal Sensitivity Scale (.53)!
Reliability. The test manual (Derogatis, 1977) Under “discriminative” validity are studies
provides internal consistency (coefficient alpha) where the SCL-90R is said to provide “clinical dis-
and test-retest (1 week) information. Internal crimination” or usefulness in dealing with differ-
consistency coefficients range from .77 to .90, ent diagnostic groups. Thus studies are included
with most in the mid .80s. Test-retest coefficients in this section, that deal with oncology patients,
range from .78 to .90, again with most coeffi- drug addicts, dropouts from West Point Mili-
cients in the mid .80s. Thus from both aspects of tary Academy, and others. Finally, there is a sec-
reliability, the SCL-90R seems quite adequate. tion on construct validity that presents evidence
from factor-analytic studies showing a relatively
Validity. The test manual discusses validity find- good match between the original dimensions on
ings under three headings: concurrent, discrim- the SCL-90R and the results of various factor-
inative, and construct validity. Under concur- analytic procedures. The SCL-90R has been par-
rent validity are studies that compare SCL-90R ticularly useful in studies of depression (e.g.,
scores with those obtained on other multivariate Weissman, Sholomskas, Pottenger, et al., 1977).
measures of psychopathology such as the MMPI.
These results indicate that the SCL-90R scales
correlate .40 to .60 with their counterparts on Factorial invariance. How stable are the 9
the MMPI, but the results are somewhat com- dimensions in various groups that may differ
plex. For example, the Psychoticism Scale of the from each other on selected aspects, such as
SCL-90R does correlate significantly with the gender, age, or psychiatric status? In one sense,
MMPI Schizophrenia Scale (.64), but so does this is a question of generalizability and can be
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

166 Part Two. Dimensions of Testing

viewed with respect to both reliability and valid- the identification of individuals who may require
ity. Couched in somewhat different language, it psychiatric hospitalization or criminal institu-
is also the question of factorial invariance – does tionalization. The PSI is not intended to be a
the factor pattern remain constant across differ- diagnostic instrument but is a screening device to
ent groups? Such constancy, or lack of it, can be used to detect persons who might receive more
be viewed as good or bad, depending on one’s intensive attention. For example, a counselor or
point of view. Having factorial invariance might mental health worker might administer the PSI to
indicate stability of dimensions across different a client to determine whether that client should
groups; not having such invariance may in fact be referred to a psychologist or psychiatrist for
reflect the real world. For example, in psychi- further assessment. The PSI consists of 130 true-
atric patients we might expect the various SCL- false items that comprise five scales: two of the
90R dimensions to be separate symptom entities, scales (Alienation and Social Nonconformity)
but in normal subjects it might not be surprising were designed to identify individuals unable to
to have these dimensions collapse into a general function normally in society, and two of the scales
adjustment type of factor. The SCL-90R man- (Discomfort and Expression) were developed to
ual suggests that the SCL-90R has such factorial assess what the author considers major dimen-
invariance for gender, social class, and psychiatric sions of personality; one scale (Defensiveness)
diagnosis (see Cyr, Doxey, & Vigna, 1988). was designed to assess “fake good” and “fake bad”
tendencies. (A sixth scale, Infrequency or Ran-
Interpretation. The test manual indicates that dom Response, was later added but is not scored
interpretation of a protocol is best begun at the on the profile sheet.)
global level, with a study of the GSI, PSI, and PSDI According to the author (Lanyon, 1968) the
as indicators of overall distress. The next step is Alienation Scale attempts to identify individu-
to analyze the nine primary symptom dimen- als we might expect to be patients in a psychi-
sions that provide a “broad-brush” profile of the atric institution; the scale was developed empiri-
patient’s psychodynamic status. Finally, an anal- cally to differentiate between normal subjects and
ysis of the individual items can provide infor- psychiatric patients. The Social Nonconformity
mation as to whether the patient is suicidal or Scale attempts to identify individuals we might
homicidal, what phobic symptoms are present, expect to find in prison. This scale also was devel-
and so on. oped empirically to differentiate between normal
subjects and state-reformatory inmates. The Dis-
Norms. Once the raw scores are calculated for comfort Scale is more typically a dimension that
a protocol, the raw scores can be changed to T is labeled as neuroticism, general maladjustment,
scores by consulting the appropriate normative or anxiety while the dimension addressed by the
table in the manual. Such norms are available for Expression Scale is often labeled extraversion or
male (n = 490) and female (n = 484) normal undercontrol. The names of the scales were cho-
or nonpatient subjects, for male (n = 424) and sen to be “nontechnical,” so they are not the best
female (n = 577) psychiatric outpatients, and for labels that could have been selected.
adolescent outpatients (n = 112). In addition,
mean profiles are given for some 20 clinical sam- Development. The PSI was developed by estab-
ples ranging from psychiatric outpatients, alco- lishing a pool of items that were face valid and
holics seeking treatment, to patients with sexual related to the five dimensions to be scaled. These
dysfunctions. items were brief, in about equal proportions of
keyed true and keyed false, and written so as to
minimize social desirability. The resulting 220
The Psychological Screening Inventory
items were administered to a sample of 200 (100
(PSI)
male and 100 female) normal subjects chosen to
Introduction. The PSI (Lanyon, 1968) was be representative of the U.S. population in age
designed as a relatively brief test to identify psy- and education and comparable in socioeconomic
chological abnormality, in particular as an aid to status and urban-rural residence. The Alienation
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 167

Scale, composed of 25 items, was developed by some clearly oriented toward pathology (such
criterion groups’ analyses between the normal as hearing strange voices), but many are fairly
sample and two samples (N = 144) of psychiatric normal in appearance, such as being healthy for
patients, primarily schizophrenics; the intent of one’s age, or being extremely talkative. Versions
this scale is to indicate a respondent’s similarity of the PSI are available in various languages,
of response to those of hospitalized psychiatric such as Spanish and Japanese. Most inventories
patients. Thus this scale is essentially a scale of of this type use a separate question booklet and
serious psychopathology. The Social Nonconfor- answer sheet. The basic advantage is that the test
mity Scale, also made up of 25 items, was devel- booklets can be reused, but having a separate
oped by a parallel analysis between the normal answer sheet means the possibility of the client
group and 100 (50 male, 50 female) reformatory making mistakes, such as skipping one ques-
inmates; this scale then measures the similarity tion and having all subsequent responses out of
of response on the part of a client to those who kilter. On the PSI, responses are marked right
have been jailed for antisocial behavior. after each item, rather than on a separate answer
The Discomfort and Expression Scales, each sheet.
with 30 items, were developed by internal consis-
tency analyses of the items originally written for Administration. The PSI can be self-
these scales, using the responses from the normal administered, used with one client or a large
subjects only. You recall that this method involves group. It takes about 15 minutes to complete,
correlating the responses on each item with the and less than 5 minutes to score and plot the
total scale score and retaining the items with the profile.
highest correlations. The Discomfort Scale mea-
sures susceptibility to anxiety, a lack of enjoy- Scoring. The PSI can easily be scored by hand
ment of life, and the perceived presence of many through the use of templates, and the results eas-
psychological difficulties. The Expression Scale ily plotted on a graph, separately by gender. The
measures extraversion-introversion, so that high raw scores can be changed to T scores by either
scorers tend to be sociable extraverted, but also using the table in the manual, or by plotting the
unreliable, impulsive and undercontrolled; low scores on the profile sheet. On most multivariate
scorers tend to be introverted, thorough, and also instruments, like the MMPI for example, scores
indecisive and overcontrolled. that are within two SDs from the mean (above 30
Finally, the Defensiveness scale, composed of and below 70 in T scores) are considered “nor-
20 items, was developed by administering the mal.” On the PSI, scores that deviate by only 1 SD
pool of items to a sample of 100 normal sub- from the mean are considered significant. Thus
jects, three times: once under normal instruc- one might expect a greater than typical number of
tions, once with instructions to “fake bad,” and false positives. However, the PSI manual explicitly
once with instructions to “fake good.” Items that instructs the user on how to determine a cutoff
showed a significant response shift were retained score, based on local norms, that will maximize
for the scale. High scores therefore indicate that the number of hits.
the subject is attempting to portray him or her-
self in a favorable light, while low scores reflect Reliability. Test-retest reliability coefficients
a readiness to admit undesirable characteristics. range from .66 to .95, while internal consistency
This scale appears to be quite similar to the K coefficients range from .51 to .85. These coef-
scale of the MMPI . The above procedures yielded ficients are based on normal college students,
a 130-item inventory. A supplemental scale, Ran- and as Golding (1978) points out, may be
dom Response (RA) was developed to assess the significantly different in a clinical population.
likelihood of random responding; this scale is Certainly the magnitude of these coefficients
analogous to the MMPI F scale. suggests that while the PSI is adequate for
research purposes and group comparisons,
Description. Most of the PSI items are of the extreme caution should be used in making
kind one would expect on a personality test, with individual decisions.
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

168 Part Two. Dimensions of Testing

Validity. The initial validity data are based on probationary group did score significantly higher
comparisons of two small psychiatric cross- on the Alienation, Discomfort, and Social Non-
validation groups (Ns of 23 and 27), a prison conformity Scales (the results on the Discom-
group (N = 80), and subjects (N = 45; fort and Social Nonconformity Scales applied to
presumably college students) given the fake good females only).
and fake bad instructions. In each case, the mean Another example is the study by Mehryar, Hek-
scores on the scales seem to work as they should. mat, and Khajavi (1977) who administered the
For example, the Alienation means for the psychi- PSI to a sample of 467 undergraduate students, of
atric patients are T scores ranging from 66 to 74 whom 111 indicated that they had “seriously con-
(remember that 50 is average), while Social Non- sidered suicide.” A comparison of the “suicidal”
conformity means for the prison subjects are at vs. nonsuicidal students indicated significant dif-
67 for both genders. Under instructions to fake ferences on four of the five PSI scales, with suici-
good, the mean Defensiveness scores jump to 63 dal students scoring higher on Alienation, Social
and 67, while under instructions to fake bad the Nonconformity, and Discomfort, and lower on
mean scores go down to 31. Defensiveness.
The test manual (Lanyon, 1973; 1978) indi- Many other studies present convergent and
cates considerable convergent and discriminant discriminant validity data by comparing the PSI
validity by presenting correlations of the PSI with other multivariate instruments such as the
scales with those of inventories such as the MMPI. MMPI and CPI, and the results generally support
Other studies are presented using contrasted the validity of the PSI (Vieweg & Hedlund, 1984).
groups. For example, in one study of 48 psy-
chiatric patients and 305 normal males, a cutoff Norms. The initial norms were based on a sample
score of 60 on the Alienation Scale achieved an of 500 normal males and 500 normal females,
86% overall hit rate, with 17% of the psychiatric with scores expressed as T scores; thus separate
cases misidentified as false negatives, and 14% of norms by gender are given. These subjects came
the normal individuals misidentified as false pos- from four geographical states, and ranged in age
itives. Incidentally, the criterion for being “nor- from 16 to 60. Note that since the basic diagnostic
mal” is a difficult one – ordinarily it is defined question here is whether a subject is normal, the
either by self-report or by the absence of some norms are based on normal subjects rather than
criteria, such as the person has not been hospi- psychiatric patients. Norms for 13- to 16-year-
talized psychiatrically or has not sought psycho- olds are presented by Kantor, Walker, and Hays
logical help. These operational definitions do not (1976).
guarantee that subjects identified as normal are
indeed normal. Factor analysis. J. H. Johnson and Overall
A number of studies can be found in the lit- (1973) did a factor analysis of the PSI responses
erature that also support the validity of the PSI. of 150 introductory psychology college students.
One example is that of Kantor, Walker, and Hays They obtained three factors which they labeled
(1976) who administered the PSI to two sam- as introversion, social maladjustment, and emo-
ples of adolescents: 1,123 13- to 16-year-old stu- tional maladjustment. These factors seem to par-
dents in a school setting and 105 students of allel the PSI scales of Expression, Social Noncon-
the same age who were on juvenile probation. formity, and a combination of Discomfort and
Students in the first sample were also asked to Alienation. Notice that these subjects were college
answer anonymously a questionnaire that con- students presumably well-functioning, normal
tained three critical questions: (1) Did you run individuals; the results might have been closer
away from home last year? (2) Have you been to the five dimensions postulated by Lanyon had
in trouble with the police? (3) Have you stolen the subjects been psychiatric patients. Neverthe-
anything worth more than $2? A subgroup was less, J. H. Johnson and Overall (1973) concluded
then identified of youngsters who answered one that their results supported the scoring procedure
or more of the critical questions in the keyed proposed by Lanyon (i.e., five scales).
direction. The PSI scales did not differentiate the Lanyon, J. H. Johnson, and Overall (1974) car-
subgroup from the larger normal group, but the ried out a factor analysis of the 800 protocols
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 169

that represented the normative sample of nor- psychiatric patient. By using such an equation,
mal adults, ages 16 to 60. The results yielded five first with the regular PSI-scale scores, and then
factors: Factor I represents a dimension of seri- with the factor-scale scores, we can ask which set
ous psychopathology, and contains items from 4 of scales is more accurate in identifying group
PSI scales; Factor II represents an extraversion- membership? In this study, use of the standard
introversion dimension, with most of the items PSI-scale scores resulted in 17.5% of each group
coming from the Expression Scale; The third being misclassified – or 92.5% correct hits. Use of
factor seems to be an acting-out dimension, the factor scores resulted in 21.5% of each group
although it is made up of few items and does being misclassified. Thus the standard PSI scale-
not seem to be a robust dimension; Factor IV scoring procedure resulted in a superior screen-
represents the “Protestant Ethic” defined as dili- ing index.
gence in work and an attitude of responsibility.
It too is made up of items from four of the PSI
scales; Finally, Factor V is a general neuroticism Ethical-legal issues. Golding (1978) suggests
factor, made up primarily of items from the Dis- that the use of the PSI can raise serious ethical-
comfort Scale. Note then, that two of the factors legal issues. If the PSI is used with patients who
(extraversion and general neuroticism) parallel are seeking treatment, and the resulting scores are
the two PSI scales that were developed on the used for potential assignment to different treat-
basis of factor analysis. Two other factors (seri- ments, there seems to be little problem other than
ous psychopathology and acting out) show some to ensure that whatever decisions are taken are
congruence to the Alienation and Social Non- made on the basis of all available data. But if the
conformity Scales, but the parallel is not very individual is unwilling to seek treatment and is
high. nevertheless screened, for example in a military
Overall (1974) did administer the PSI to 126 setting or by a university that routinely admin-
new patients at a psychiatric outpatient clinic, isters this to all incoming freshmen, and is then
and compared the results to those obtained on identified as a potential “misfit,” serious ethical-
a sample of 800 normal subjects, originally col- legal issues arise. In fact, Bruch (1977) admin-
lected as norms by the author of the PSI. What istered the PSI to all incoming freshmen over a
makes this study interesting and worth report- 2-year period, as part of the regular freshman
ing here is that Overall (1974) scored each PSI orientation program. Of the 1,815 students who
protocol on the standard five scales, and then completed the PSI, 377 were eventually seen in
rescored each protocol on a set of five scales that the Counseling Center. An analysis of the PSI
resulted from a factor analysis (presumably those scores indicated that students who became Coun-
obtained in the Lanyon, J. H. Johnson, and Over- seling Center clients obtained higher scores on
all 1974 study cited above). Note that a factor the Alienation, Social Nonconformity, and Dis-
analysis can yield the same number of dimen- comfort Scales, and lower scores on the Defen-
sions as originally postulated by clinical insight, siveness Scale.
but the test items defining (or loading) each fac-
tor dimension may not be the exact ones found
on the original scales. Overall (1974) computed Other criticisms. The PSI has been criticized on a
a discriminant function (very much like a regres- number of grounds, including high intercorrela-
sion equation), which was as follows: tions among its scales and high correlations with
measures of social desirability (Golding, 1978).
Y = +.417AL + .034Sn + .244Di + .046E x Yet, Pulliam (1975) for example, found support
+.307De for the idea that the PSI scales are not strongly
affected by social desirability. One study that used
For each person then, we would take their scores the PSI with adolescents in a juvenile court related
on the PSI, plug the values in the above equation, agency, found the PSI to be of little practical
do the appropriate calculations, and compute Y. use and with poor discriminant validity (Feazell,
In this case, Y would be a number that would Quay, & Murray, 1991). For a review of the PSI,
predict whether the individual is normal or a see Streiner (1985), Vieweg and Hedlund (1984).
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

170 Part Two. Dimensions of Testing

THE MINNESOTA MULTIPHASIC by pooling together those items that differen-


PERSONALITY INVENTORY (MMPI) tiated significantly between a specific diagnos-
AND MMPI-2 tic group, normal subjects, and other diagnostic
groups. Thus items for the Schizophrenia Scale
The MMPI included those for which the response rate of
schizophrenics differed from those of normals
Introduction. The MMPI was first published in
and those of other diagnostic groups. Scales were
1943 and has probably had the greatest impact
then cross-validated by administering the scale
of any single test on the practice of psychology
to new samples, both normal and psychiatric
and on psychological testing. Its authors, a psy-
patients, and determining whether the total score
chologist named Starke Hathaway and a psychi-
on the scale statistically differentiated the various
atrist J. Charnley McKinley, were working at the
samples.
University of Minnesota hospital and developed
Thus eight clinical scales, each addressed to
the MMPI for use in routine diagnostic assess-
a particular psychiatric diagnosis were devel-
ment of psychiatric patients. Up to that time diag-
oped. Later, two additional scales were devel-
nosis was based on the mental-status examina-
oped that became incorporated into the standard
tion, but the psychiatrists who administered this
MMPI profile. These were: (1) a Masculinity-
were extremely overworked due to the large num-
Femininity (Mf) Scale originally designed to dis-
bers of veterans returning from the battlefields of
tinguish between homosexual and heterosexual
World War II who required psychiatric assistance
males, but composed primarily of items show-
including, as a first step, psychiatric diagnosis.
ing a gender difference; and (2) a Social Introver-
sion (Si) Scale composed of items whose response
Criterion keying. Before the MMPI, a number rate differed in a group of college women who
of personality inventories had been developed, participated in many extracurricular activities
most of which were scored using a logical key- vs. a group who participated in few if any such
ing approach. That is, the author generated a set activities. This empirical approach to scale con-
of items that were typically face valid (“Are you struction resulted in many items that were “sub-
a happy person?”) and scored according to the tle” – i.e., their manifest content is not directly
preconceived notions of the author; if the above related to the psychopathological dimension they
item was part of an optimism scale, then a true are presumed to assess. This distinction between
response would yield 1 point for optimism. Hath- “subtle” and “obvious” items has become a major
away and McKinley, however, chose to determine research focus; it has in fact been argued that
empirically whether an item was responded to such subtle items reduce the validity of the MMPI
differentially in two groups – a psychiatric group scales (Hollrah, Schlottmann, Scott, et al., 1995).
versus a “normal” group. In addition to these 10 clinical scales, four
other scales, called validity scales, were also devel-
Development. A large pool of true-false items, oped. The purpose of these scales was to detect
close to 1,000, was first assembled. These items deviant test-taking attitudes. One scale is the
came from a wide variety of sources, including Cannot Say Scale, which is simply the total num-
the mental-status exam, personality scales, text- ber of items that are omitted (or in rare cases,
book descriptions, case reports, and consulta- answered as both true and false). Obviously, if
tions with colleagues. This pool of items was then many items are omitted, the scores on the rest of
analyzed clinically and logically to delete duplica- the scales will tend to be lower. A second valid-
tions, vague statements, and so on. The remain- ity scale is the Lie Scale designed to assess faking
ing pool of some 500-plus items was then admin- good. The scale is composed of 15 items that most
istered to groups of psychiatric patients who people, if they are honest, would not endorse –
had diagnoses such as hypochondriasis, depres- for example, “I read every editorial in the news-
sion, paranoia, and schizophrenia. The pool of paper every day.” These items are face valid and
items was also administered to normal sam- in fact were rationally derived.
ples, primarily the relatives and visitors of the A third validity scale, the F Scale, is com-
psychiatric patients. Scales were then formed posed of 60 items that fewer than 10% of the
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 171

normal samples endorsed in a particular direc- is usually followed by specific therapeutic proce-
tion. These items cover a variety of content; a fac- dures. But in psychology, this medical model is
tor analysis of the original F Scale indicated some limited and misleading. Although diagnosis is a
19 content dimensions, such as poor physical short hand for a particular constellation of symp-
health, hostility, and paranoid thinking (Comrey, toms, what is important is the etiology of the dis-
1958). order and the resulting therapeutic regime – that
Finally, the fourth validity scale is called the K is, how did the client get to be the way he or she
Scale and was designed to identify clinical defen- is, and what can be done to change that. In psy-
siveness. The authors of this scale (Meehl & Hath- chopathology, the etiology is often complex, mul-
away, 1946) noticed that for some psychiatric tidetermined, and open to different arguments,
patients, their MMPI profile was not as deviant and the available therapies are often general and
as one might expect. They therefore selected 30 not target specific.
MMPI items for this scale that differentiated Because in fact, reliable differences in MMPI
between a group of psychiatric patients whose scores were obtained between individuals who
MMPI profiles were normal (contrary to expec- differed in important ways, the focus became on
tations), and a group of normal subjects whose what these differences meant. It became more
MMPI profiles were also normal, as expected. A important to understand the psychodynamic
high K score was assumed to reflect defensiveness functioning of a client and that client’s strengths
and hence lower the deviancy of the resulting and difficulties. The diagnostic names of the
MMPI profile. Therefore, the authors reasoned MMPI scales became less important, and in fact
that the K score could be used as a correction they were more or less replaced by a number-
factor to maximize the predictive validity of the ing system. As J. R. Graham (1993) states, each
other scales. Various statistical analyses indicated MMPI scale became an unknown to be studied
that this was the case, at least for 5 of the 10 scales, and explored. Thousands of studies have now
and so these scales are plotted on the profile sheet been carried out on the MMPI and the clinician
by adding to the raw score a specified proportion can use this wealth of data to develop a psycholog-
of the K raw score. ical portrait of the client and to generate hypothe-
ses about the dynamic functioning of that client
Original aim. The original aim of the MMPI was (see Caldwell, 2001).
to provide a diagnostic tool that could be group
administered, that could save the valuable time of
the psychiatrist, and that would result in a diag- Numerical designation. The 10 clinical scales are
nostic label. Thus, if a patient scored high on numbered 1 to 10 as follows:
the Schizophrenia Scale, that patient would be
diagnosed as schizophrenic. The resulting MMPI Scale # Original name
however, did not quite work this way. Depressed 1 Hypochondriasis
patients did tend to score high on the Depression 2 Depression
Scale, but they also scored high on other scales. 3 Hysteria
Similarly, some normal subjects also scored high 4 Psychopathic Deviate
on one or more of the clinical scales. It became 5 Masculinity-Femininity
readily apparent that many of the clinical scales 6 Paranoia
were intercorrelated and that there was substan- 7 Psychasthenia
tial item overlap between scales. Thus it would 8 Schizophrenia
be unlikely for a patient to obtain a high score on 9 Hypomania
only one scale. In addition, the psychiatric nosol- 10 Social introversion
ogy, which in essence was the criterion against
which the MMPI was validated, was rather unre- The convention is that the scale number is used
liable. Finally, clinicians realized that the diagnos- instead of the original scale name. Thus a client
tic label was not a particularly important piece of scores high on scale 3, rather than on hysteria.
information. In medicine of course, diagnosis is In addition various systems of classifying MMPI
extremely important because a specific diagnosis profiles have been developed that use the number
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

172 Part Two. Dimensions of Testing

designations – thus a client’s profile may be MMPI, the MMPI-2 has found extensive appli-
a “2 4 7.” cations in many cultures, not only in Europe and
Latin America, but also in Asia and the Middle
East (see Butcher, 1996).
The MMPI-2
Revision of MMPI. The MMPI is the most widely New scales on the MMPI-2. The MMPI-2 con-
used personality test in the United States and pos- tains three new scales, all of which are valid-
sibly in the world (Lubin, Larsen, & Matarazzo, ity scales rather than clinical scales. One is the
1984), but it has not been free of criticism. One Backpage Infrequency Scale (Fb), which is simi-
major concern was the original standardization lar to the F Scale, but is made up of 40 items that
sample, the 724 persons who were visitors to the occur later in the test booklet. The intent here
University of Minnesota hospital, often to visit is to assess whether the client begins to answer
a relative hospitalized with a psychiatric diagno- items randomly somewhere after the beginning
sis, were not representative of the U.S. general of the test. A second scale is the Variable Incon-
population. sistency Scale (VRIN). This scale consists of 67
In 1989, a revised edition, the MMPI-2, was pairs of items that have either similar or opposite
published (Butcher, Dahlstrom, Graham, et al., content, and the scoring of this scale reflects the
1989). This revision included a rewrite of many number of item pairs that are answered inconsis-
of the original items to eliminate wording that tently. A third scale is the True Response Incon-
had become obsolete, sexist language, and items sistency Scale (TRIN) which consists of 23 pairs
that were considered inappropriate or potentially of items that are opposite in content, and the
offensive. Of the original 550 items found in the scoring, which is somewhat convoluted (see J. R.
1943 MMPI, 82 were rewritten, even though most Graham, 1993), reflects the tendency to respond
of the changes were slight. Another 154 new items true or false indiscriminately.
were added to adequately assess such aspects as
drug abuse, suicide potential, and marital adjust- Administration. The MMPI-2 is easily adminis-
ment. An additional change was to obtain a nor- tered and scored; it can be scored by hand using
mative sample that was more truly representative templates or by computer. It is, however, a highly
of the U.S. population. Potential subjects were sophisticated psychological procedure, and inter-
solicited in a wide range of geographical loca- pretation of the results requires a well-trained
tions, using 1980 Census data. The final sam- professional.
ple consisted of 2,600 community subjects, 1,138 The MMPI-2 is appropriate for adolescents
males and 1,462 females, including 841 couples, as young as age 13, but is primarily for adults.
and representatives of minority groups. Other Understanding of the items requires a minimum
groups were also tested including psychiatric eighth-grade reading level, and completing the
patients, college students, and clients in mari- test requires some 60 to 90 minutes on the aver-
tal counseling. To assess test-retest reliability, 111 age. Originally, the MMPI items were printed
female and 82 male subjects were retested about individually on cards that the client then sorted;
a week later. subsequently, items were printed in a reusable
At the same time that the adult version of the booklet with a separate answer sheet, and that is
MMPI was being revised, a separate form for ado- the format currently used. The 1943 MMPI was
lescents was also pilot tested, and a normative also available in a form with a hard back that
sample of adolescents also assessed. This effort could be used as a temporary writing surface.
however, has been kept separate. The MMPI-2 is available for administration on a
The resulting MMPI-2 includes 567 items, and personal computer, as well as a tape-recorded ver-
in many ways is not drastically different from its sion for subjects who are visually handicapped.
original. In fact, considerable effort was made The MMPI and MMPI-2 have been translated in
to ensure continuity between the 1943 and the many languages.
1989 versions. Thus most of the research findings There is a “shortened” version of the MMPI in
and clinical insights pertinent to the MMPI are that in order to score the standard scales only the
still quite applicable to the MMPI-2. As with the first 370 items need to be answered. Subsequent
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 173

items are either not scored or scored on special Table 7.2 provides a listing of each scale with
scales. Thus the MMPI, like the CPI, represents an some interpretive statements. What is consid-
“open” system where new scales can be developed ered a high or low score depends upon a number
as the need arises. of aspects such as the client’s educational back-
ground, intellectual level, and socioeconomic
Scoring. Once the scales are hand scored, the raw status. In general, however, T scores of 65 and
scores are written on the profile, the K correction above are considered high (some authors say 70),
is added where appropriate, and the resulting raw and T scores of 40 and below are considered low
scores are plotted. The scores for the 10 clinical (remember that with T scores the SD = 10).
scales are then connected by a line to yield a pro- A third step is to use a configural approach, to
file. The resulting profile “automatically” changes look for patterns and scale combinations that are
the raw scores into T scores. Of course if the test is diagnostically and psychodynamically useful.
computer scored, all of this is done by the com-
puter. There are in fact a number of computer Configural interpretation. The richness of the
services that can provide not only scoring but MMPI lies not simply in the fact that there
also rather extensive interpretative reports (see are 10 separate clinical scales, but that the pat-
Chapter 17). tern or configuration of the scales in relation to
each other is important and psychodynamically
Uniform T scores. You recall that we can change meaningful. Originally, a number of investiga-
raw scores to z scores and z scores to T scores tors developed various sets of rules or procedures
simply by doing the appropriate arithmetic oper- by which MMPI profiles could be grouped and
ations. Because the same operations are applied categorized. Once a client’s profile was thus iden-
to every score, in transforming raw scores to T tified, the clinician could consult a basic source,
scores we do not change the shape of the under- such as a handbook of profiles (e.g., Hathaway
lying distribution of scores. As the distributions & Meehl, 1951), to determine what personality
of raw scores on the various MMPI scales are characteristics could be reasonably attributed to
not normally distributed, linear T scores are not that profile. Many of the classificatory systems
equivalent from scale to scale. For one scale a T that were developed were cumbersome and con-
score of 70 may represent the 84th percentile, but voluted, and their usefulness was limited to only
for another scale a T score of 70 may represent the small subset of profiles that could be so classi-
the 88th percentile. To be sure, the differences are fied. The system developed by Welsh (1948) is one
typically minor but can nevertheless be problem- that is used more commonly. Briefly, this involves
atic. For the MMPI-2 a different kind of T trans- listing the 10 clinical scales, using their numerical
formation is used for the clinical scales (except labels, in order of T score magnitude, largest first,
scales 5 and 0), and these are called uniform T and then three of the validity scales (L, F, K) also
scores, which have the same percentile equivalent in order of magnitude. If two scales are within
across scales (see Graham, 1993 for details). 1 T-score point of each other, their numbers are
underlined; if they are the same numerically, they
Interpretation. The first step is to determine are listed in the profile order, and underlined.
whether the obtained profile is a valid one. There To indicate the elevation of each scale, there is a
are a number of guidelines available on what cut- shorthand set of standard symbols. For example:
off scores to use on the various validity scales, but 6∗ 89”7/ etc. would indicate that the T score for
the determination is not a mechanical one, and scale 6 is between 90 and 99, the T scores for
requires a very sophisticated approach. A num- scales 8 and 9 are between 80 and 89 and are
ber of authors have developed additional scales to either identical or within 1 point of each other
detect invalid MMPI profiles, but whether these because they are underlined, and the T score for
function any better than the standard validity scale 7 is between 50 and 59.
scales is questionable (e.g., Buechley & Ball, 1952; Recently, the interest has focused on 2-
Gough, 1954; R. L. Greene, 1978). scale or 3-scale groupings of profiles. Suppose
A second step is to look at each of the clin- for example, we have a client whose high-
ical scales and note their individual elevation. est MMPI scores occur on scales 4 and 8. By
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

174 Part Two. Dimensions of Testing

Table 7–2. MMPI-2 Clinical Scales


Scale & Number of Items Scale Description
1. Hypochondriasis (Hs) Designed to measure preoccupation with one’s body (somatic
(32) concerns) and fear of illness. High scores reflect denial of
good health, feelings of chronic fatigue, lack of energy, and
sleep disturbances. High scorers are often complainers,
self-centered, and cynical.
2. Depression (D) (57) Designed to measure depression. High scorers feel depressed
and lack hope in the future. They may also be irritable and
high-strung, have somatic complaints, lack self-confidence,
and show withdrawal from social and interpersonal activities.
3. Hysteria (Hy) (60) This scale attempts to identify individuals who react to stress
and responsibility by developing physical symptoms. High
scorers are usually psychologically immature and
self-centered. They are often interpersonally oriented but are
motivated by the affection and attention they get from
others, rather than a genuine interest in other people.
4. Psychopathic Deviate The Pd type of person is characterized by asocial or amoral
(Pd) (50) behavior such as excessive drinking, sexual promiscuity,
stealing, drug use, etc. High scorers have difficulty in
incorporating the values of society and rebel toward
authorities, including family members, teachers, and work
supervisors. They often are impatient, impulsive, and poor
planners.
5. Masculinity-Femininity Scores on this scale are related to intelligence, education, and
(Mf) (56) socioeconomic status, and this scale more than any other,
seems to reflect more personality interests than
psychopathology. Males who score high (in the feminine
direction) may have problems of sexual identity or a more
androgynous orientation. Females who score high (in the
masculine direction) are rejecting of traditional female role
and have interests that in our culture are seen as more
masculine.
6. Paranoia (Pa) (40) Paranoia is marked by feelings of persecution, suspiciousness,
and grandiosity, and other evidences of disturbed thinking. In
addition to these characteristics, high scorers may also be
suspicious, hostile, and overly sensitive.
7. Psychasthenia (Pt) (48) Psychasthenic (a term no longer used) individuals are
characterized by excessive doubts, psychological turmoil, and
obsessive-compulsive aspects. High scorers are typically
anxious and agitated individuals who worry a great deal and
have difficulties in concentrating. They are orderly and
organized but tend to be meticulous and overreactive.
8. Schizophrenia (Sc) (78) Schizophrenia is characterized by disturbances of thinking,
mood, and behavior. High scorers, in addition to the psychotic
symptoms found in schizophrenia, tend to report unusual
thoughts, may show extremely poor judgment, and engage in
bizarre behavior.
9. Hypomania (Ma) (46) Hypomania is characterized by flight of ideas, accelerated motor
activity and speech, and elevated mood. High scorers tend to
exhibit an outward picture of confidence and poise, and are
typically seen as sociable and outgoing. Underneath their
facade, there are feelings of anxiousness and nervousness,
and their interpersonal relations are usually quite superficial.
10. Social introversion (Si) High scorers are socially introverted and tend to feel
uncomfortable and insecure in social situations. They tend to
be shy, reserved, and lack self-confidence.
Note: Most of the above is based on Graham (1990) and Dahlstrom, Welsh, and Dahlstrom (1972).
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 175

looking up profile 48 in one of several sources of other statistical and logical refinements were
(e.g., W. G. Dahlstrom, Welsh, & L. E. Dahlstrom, undertaken (see J. R. Graham, 1993) with the end
1972; Gilberstadt & Duker, 1965; J. R. Graham, result a set of 15 content scales judged to be inter-
1993; P. A. Marks, Seeman, & Haller, 1974), we nally consistent, relatively independent of each
could obtain a description of the personological other, and reflective of the content of most of
aspects associated with such a profile. Of course, the MMPI-2 items. These scales have such labels
a well-trained clinician would have internalized as anxiety, depression, health concerns, low self-
such profile configurations and would have little esteem, and family problems.
need to consult such sources.
Critical items. A number of investigators have
Content analysis. In the development of the identified subsets of the MMPI item pool as
clinical MMPI scales the primary focus was on being particularly critical in content, reflective of
empirical validity – that is, did a particular severe psychopathology or related aspects, where
item show a statistically significant differential endorsement of the keyed response might serve to
response rate between a particular psychiatric alert the clinician. Lachar and Wrobel (1979), for
group and the normal group. Most of the result- example, asked 14 clinical psychologists to iden-
ing scales were quite heterogeneous in content, tify critical items that might fall under one of 14
but relatively little attention was paid to such con- categories such as deviant beliefs and problem-
tent. The focus was more on the resulting profile. atic anger. After some additional statistical analy-
Quite clearly, however, two clients could obtain ses, 111 such items listed under 5 major headings
the same raw score on a scale by endorsing dif- were identified.
ferent combinations of items, so a number of
investigators suggested systematic analyses of the Factor analysis. Factor analysis of the MMPI
item content. Harris and Lingoes (cited by W. typically yields two basic dimensions, one of
G. Dahlstrom, Welsh, & L. E. Dahlstrom, 1972) anxiety or general maladjustment and the other
examined the item content of six of the clinical of repression or neuroticism (Eichman, 1962;
scales they felt were heterogeneous in compo- Welsh, 1956). In fact, Welsh (1956) developed
sition, and logically grouped together the items two scales on the MMPI to assess the anxiety and
that seemed similar. These groupings in turn repression dimensions, by selecting items that
became subscales that could be scored, and in fact were most highly loaded on their respective fac-
28 such scales can be routinely computer scored tors and further selecting those with the highest
on the MMPI-2. For example, the items on scale 2 internal-consistency values.
(Depression) fall into five clusters labeled subjec- Note that there are at least two ways of fac-
tive depression, psychomotor retardation, physi- tor analyzing an inventory such as the MMPI.
cal malfunctioning, mental dullness, and brood- After the MMPI is administered to a large sam-
ing. Note that these subgroupings are based on ple of subjects, we can score each protocol and
clinical judgment and not factor analysis. factor analyze the scale scores, or we can factor
A different approach was used by Butcher, analyze the responses to the items. The Eich-
Graham, Williams, et al. (1990) to develop con- man (1962) study took the first approach. John-
tent scales for the MMPI-2. Rather than start with son, Null, Butcher, et al. (1984) took the second
the clinical scales, they started with the item pool, approach and found some 21 factors, including
and logically defined 22 categories of content. neuroticism, psychoticism, sexual adjustment,
Three clinical psychologists then assigned each and denial of somatic problems. Because the orig-
item to one of the categories. Items for which inal clinical scales are heterogeneous, a factor
there was agreement as to placement then rep- analysis, which by its nature tends to produce
resented provisional scales. Protocols from two more homogeneous groupings, would of course
samples of psychiatric patients and two samples result in more dimensions. An obvious next step
of college students were then subjected to an would be to use such factor analytic results to
internal consistency analysis, and the response construct scales that would be homogeneous. In
to each item in a provisional scale was corre- fact this has been done (Barker, Fowler, & Peter-
lated with the total score on that scale. A number son, 1971; K. B. Stein, 1968), and the resulting
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

176 Part Two. Dimensions of Testing

scales seem to be as reliable and as valid as the Other scales continue to be developed. For
standard scales. They have not, however, “caught example, a new set of scales dubbed the “Psy-
on.” chopathology Five” (aggressiveness, psychoti-
cism, constraint, negative emotionality, and
Other scales. Over the years literally hundreds of positive emotionality) were recently developed
additional scales on the MMPI were developed. (Harkness, McNulty, & Ben-Porath, 1995). Sim-
Because the MMPI item pool is an open system, ilarly, many short forms of the MMPI have been
it is not extremely difficult to identify subjects developed. Streiner and Miller (1986) counted at
who differ on some nontest criterion, adminis- least seven such short forms and suggested that
ter the MMPI, and statistically analyze the items our efforts would be better spent in developing
as to which discriminate the contrasting groups new tests.
or correlate significantly with the nontest crite-
rion. Many of these scales did not survive cross- Reliability. Reliability of the “validity” scales and
validation, were too limited in scope, or were of the clinical scales seems adequate. The test
found to have some psychometric problems, but manual for the MMPI-2 gives test-retest (1-week
a number have proven quite useful and have been interval) results for a sample of males and a sam-
used extensively. ple of females. The coefficients range from a low
One such scale is the Ego Strength Scale (Es) of .58 for scale 6 (Pa) for females, to a high of .92
developed by Barron (1953) to predict success for scale 0 (Si) for males. Of the 26 coefficients
in psychotherapy. The scale was developed by given (3 validity scales plus 10 clinical scales, for
administering the MMPI to a group of neurotic males and for females), 8 are in the .70s and 12
patients and again after 6 months of psychother- in the .80s, with a median coefficient of about
apy, comparing the responses of those judged .80.
to have clearly improved vs. those judged as Since much of the interpretation of the MMPI
unimproved. The original scale had 68 items; the depends upon profile analysis, we need to
MMPI-2 version has 52. Despite the fact that the ask about the reliability of configural patterns
initial samples were quite small (n = 17 and 16 because they may not necessarily be the same
respectively), that reported internal-consistency as the reliability of the individual scales. Such
values were often low (in the .60s), and that the data are not yet available for the MMPI-2, but
literature on the Es Scale is very inconsistent in its some is available for the MMPI. J. R. Graham
findings (e.g., Getter & Sundland, 1962; Tamkin (1993) summarizes a number of studies in this
& Klett, 1957), the scale continues to be popular. area that used different test-retest intervals and
Another relatively well known extra MMPI different kinds of samples. In general, the results
scale is the 49 item MacAndrew Alcoholism Scale suggest that about one half of the subjects have
(MAC; MacAndrew, 1965), developed to differ- the same profile configuration on the two admin-
entiate alcoholic from nonalcoholic psychiatric istrations, when such a configuration is defined
patients. The scale was developed by using a by the highest scale (a high-point code), and goes
contrasted-groups approach – an analysis of the down to about one fourth when the configuration
MMPI responses of 200 male alcoholics seeking is defined by the three highest scales (a 3-point
treatment at a clinic, vs. the MMPI responses of code). Thus the stability over time of such config-
200 male nonalcoholic psychiatric patients. The urations is not that great, although the evidence
MAC Scale has low internal consistency (alphas suggests that changes in profiles in fact reflect
of .56 for males and .45 for females) but ade- changes in behavior.
quate test-retest reliability over 1-week and 6- The MMPI-2 test manual also gives alpha coef-
week intervals, with most values in the mid .70s ficients for the two normative samples. The 26
and low .80s (J. R. Graham, 1993). Gottesman correlation coefficients range from a low of .34
and Prescott (1989) questioned the routine use to a high of .87, with a median of about .62.
of this scale, and they pointed out that when the Ten of the alpha coefficients are above .70 and
base rate for alcohol abuse is different from that 16 are below. The MMPI-2 scales are heteroge-
of the original study, the accuracy of the MAC is neous, and so these low values are not surprising.
severely affected. Scales 1, 7, 8, and 0 seem to be the most internally
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 177

consistent while scales 5, 6, and 9 are the least statistically significant, they are less than 5 T-
internally consistent. score points, and therefore not really clinically
meaningful (Timbrook & Graham, 1994).
Validity. The issue of validity of the MMPI-2 is a R. L. Greene (1987) reviewed 10 studies that
very complex one, not only because we are deal- compared Hispanic and white subjects on the
ing with an entire set of scales rather than just MMPI. The differences seem to be even smaller
one, but also because there are issues about the than those between blacks and whites, and R.
validity of configural patterns, of interpretations L. Greene (1987) concluded that there was no
derived from the entire MMPI profile, of differ- pattern to the obtained differences. R. L. Greene
ential results with varying samples, and of the (1987) also reviewed seven studies that compared
interplay of such aspects as base rates, gender, American Indians and whites and three stud-
educational levels of the subjects, characteristics ies that compared Asian-American and white
of the clinicians, and so on. subjects. Here also there were few differences
J. R. Graham (1993) indicates that validity and no discernible pattern. Hall, Bansal, and
studies of the MMPI fall into three general cate- Lopez (1999) did a meta-analytical review and
gories. The first are studies that have compared concluded that the MMPI and MMPI-2 do not
the MMPI profiles of relevant criterion groups. unfairly Portray African-Americans and Latinos
Most of these studies have found significant dif- as pathological. The issue is by no means a closed
ferences on one or more of the MMPI scales one, and the best that can be said for now is that
among groups that differed on diagnostic status great caution is needed when interpreting MMPI
or other criteria. A second category of studies try profiles of nonwhites.
to identify reliable nontest behavioral correlates
of MMPI scales or configurations. The results of MMPI manuals. There is a veritable flood of
these studies suggest that there are such reliable materials available to the clinician who wishes
correlates, but their generalizability is sometimes to use the MMPI. Not only is there a vast profes-
in question; i.e., the findings may be applicable sional body of literature on the MMPI, with prob-
to one type of sample such as alcoholics, but not ably more than 10,000 such articles, but there are
to another such as adolescents who are suicidal. also review articles, test manuals, books, hand-
A third category of studies looks at the MMPI books, collections of group profiles, case studies
results and at the clinician who interprets those and other materials (e.g., Butcher, 1990; Drake &
results as one unit, and focuses then on the accu- Oetting, 1959; J. R. Graham, 1993; R. L. Greene,
racy of the interpretations. Here the studies are 1991; P. A. Marks, Seeman, & Haller, 1974).
not as supportive of the validity of the MMPI-
based inferences, but the area is a problematic Diagnostic failure. The MMPI does not fulfill
and convoluted one (see Garb, 1984; L. R. Gold- its original aim, that of diagnostic assessment,
berg, 1968). and perhaps it is well that it does not. Label-
ing someone as schizophrenic has limited util-
Racial differences. There is a substantial body of ity, although one can argue that such psychiatric
literature on the topic of racial differences on the nosology is needed both as a shorthand and as
MMPI, but the results are by no means unani- an administrative tool. It is more important to
mous, and there is considerable disagreement as understand the psychodynamic functioning of a
to the implications of the findings. client and the client’s competencies and difficul-
A number of studies have found differences ties. In part, the diagnostic failure of the MMPI
between black and white subjects on some MMPI may be due to the manner in which the clini-
scales, with blacks tending to score higher than cal scales were constructed. Each scale is basically
whites on scales F, 8, and 9, but the differences are composed of items that empirically distinguish
small and, although statistically significant, may normals from psychiatric patients. But the clini-
not be of clinical significance (W. G. Dahlstrom, cal challenge is often not diagnosis but differen-
Lachar, & L. E. Dahlstrom, 1986; Pritchard tial diagnosis – usually it doesn’t take that much
& Rosenblatt, 1980). Similar differences have clinical skill to determine that a person is psy-
been reported on the MMPI-2, but although chiatrically impaired, but often it can be difficult
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

178 Part Two. Dimensions of Testing

to diagnose the specific nature of such impair- and only 1 male; of the 38 other patients 31 were
ment. Thus the MMPI clinical scales might have female and 7 were male. Diagnosis and gender
worked better diagnostically had they been devel- are therefore confounded.
oped to discriminate specific diagnostic groups
from each other. Criticisms. Despite the popularity and usefulness
of the MMPI, it has been severely criticized for a
Usefulness of the MMPI. There are at least two number of reasons. Initially, many of the clinical
ways to judge the usefulness of a test. The first samples used in the construction of the clinical
is highly subjective and consists of the jude- scales were quite small, and the criterion used,
ment made by the user; clinicians who use the namely psychiatric diagnosis, was relatively unre-
MMPI see it as a very valuable tool for diagnos- liable. The standardization sample, the 724 hos-
tic purposes, for assessing a client’s strengths and pital visitors, was large, but they were all white,
problematic areas, and for generating hypothe- primarily from small Minnesota towns or rural
ses about etiology and prognosis. The second areas and from skilled and semiskilled socioe-
method is objective and requires an assessment conomic levels. The statistical and psychometric
of the utility of the test by, for example, assess- procedures utilized were, by today’s standards,
ing the hits and errors of profile interpretation. rather primitive and unsophisticated.
Note that in both ways, the test and the test user The resulting scales were not only heteroge-
are integrally related. An example of the second neous (not necessarily a criticism unless one takes
way, is the study by Coons and Fine (1990), who a factor-analytic position), but there is consid-
rated “blindly” a series of 63 MMPIs as to whether erable item overlap, i.e., the same item may be
they represented patients with multiple person- scored on several scales, thus contributing to the
ality or not. In this context, rating blindly meant intercorrelations among scales. In fact, several
that the authors had no information other than of the MMPI scales do intercorrelate. The test
the MMPI profile. Incidentally, when a clinician manual for the MMPI-2 (Hathaway et al., 1989),
uses a test, it is recommended that the results be for example, reports such correlations as +.51
interpreted with as much background informa- between scales 0 (Si) and 2 (D), and .56 between
tion about the client as possible. The 63 MMPI scales 8 (Sc) and 1 (Hs).
profiles came from 25 patients with the diagno- Another set of criticisms centered on response
sis of multiple personality, and 38 patients with styles. When a subject replies true or, false to a
other diagnoses, some easily confused or coex- particular item, the hope is that the content of
istent with multiple personality. The overall hit the item elicits the particular response. There are
rate for the entire sample was 71.4% with a 68% people, however, who tend to be more acquies-
(17/25) hit rate for the patients with multiple cent and so may agree not so much because of the
personality. The false negative rates for the two item content but because of the response options,
investigators were similar (28.5% and 36.5%), they tend to agree regardless of the item content
but the false positive rates were different (44.4% (the same can be said of “naysayers,” those who
and 22.2%), a finding that the authors were at a tend to disagree no matter what). A related crit-
loss to explain. Such results are part of the infor- icism is that the response is related to the social
mation needed to evaluate the usefulness of an desirability of the item (see Chapter 16). There is
instrument, but unfortunately the matter is not in fact, an imbalance in the proportion of MMPI
that easy. items keyed true or false, and studies of the social-
In this study, for example, there are two nearly desirability dimension seem to suggest a severe
fatal flaws. The first is that the authors do not confounding.
take into account the role of chance. Because the Substantial criticism continues to be leveled at
diagnostic decision is a bivariate one (multiple the MMPI-2 in large part because of its continu-
personality or not), we have a similar situation ity with the MMPI. Helmes and Reddon (1993)
to a T-F test, where the probability of getting for example, cite the lack of a theoretical model,
each item correct is 50–50. The second and more heterogeneous scale content, and suspect diag-
serious problem is that of the 25 patients diag- nostic criteria, as major theoretical concerns. In
nosed with multiple personality, 24 were female addition, they are concerned about scale overlap,
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 179

lack of cross-validation, the role of response style, Development. In general terms, the develop-
and problems with the norms; similar criticisms ment of the MCMI followed three basic steps:
were made by Duckworth (1991).
1. An examination was made of how the items
were related to the theoretical framework held
THE MILLON CLINICAL MULTIAXIAL by Millon. This is called theoretical-substantive
INVENTORY (MCMI) validity by Millon (1987), but we could consider
The MCMI was designed as a better and more it as content validity and/or construct validity.
modern version of the MMPI. In the test manual, 2. In the second stage, called internal-structural,
Millon (1987) points out some 11 distinguishing items were selected that maximized scale homo-
features of the MCMI; 6 are of particular saliency geneity, that showed satisfactory test-retest relia-
here: bility, and that showed convergent validity.
3. The items that survived both stages were then
1. The MCMI is brief and contains only 175
assessed with external criteria; Millon (1987)
items, as opposed to the more lengthy MMPI.
called this external-criterion validity, or more
2. The measured variables reflect a comprehen- simply criterion validity.
sive clinical theory, as well as specific theoretical
notions about personality and psychopathology, Note that the above represent a variety of val-
as opposed to the empiricism that underlies the idation procedures, often used singly in the vali-
MMPI. dation of a test. Now, let’s look at these three steps
3. The scales are directly related to the DSM-III a bit more specifically.
classification, unlike the MMPI whose diagnos- The MCMI was developed by first creating a
tic categories are tied to an older and somewhat pool of some 3,500 self-descriptive items, based
outdated system. on theoretically derived definitions of the various
4. The MCMI scales were developed by compar- syndromes. These items were classified, appar-
ing specific diagnostic groups with psychiatric ently on the basis of clinical judgment, into 20
patients, rather than with a normal sample as in clinical scales; 3 scales were later replaced. All
the MMPI. the items were phrased with “true” as the keyed
5. Actuarial base-rate data were used to quantify response, although Millon felt that the role of
scales, rather than the normalized standard-score acquiescence (answering true) would be mini-
transformation used in the MMPI. mal. The item pool was then reduced on the basis
of rational criteria: Items were retained that were
6. Three different methods of validation were
clearly written, simple, relevant to the scale they
used: (1) theoretical-substantive, (2) internal-
belonged to, and reflective of content validity.
structural, and (3) external-criterion, rather than
Items were also judged by patients as to clarity
just one approach as in the MMPI.
and by mental health professionals as to relevance
to the theoretical categories. These steps resulted
Aim of the MCMI. The primary aim of the in two provisional forms of 566 items each (inter-
MCMI is to provide information to the clini- estingly, the number of items was dictated by the
cian about the client. The MCMI is also pre- size of the available answer sheet!).
sented as a screening device to identify clients In the second step, the forms were admin-
who may require more intensive evaluation, and istered to a sample of clinical patients, chosen
as an instrument to be used for research purposes. to represent both genders, various ethnic back-
The test is not a general personality inventory grounds, and a representative age range. Some
and should be used only for clinical subjects. The patients filled out one form and some patients
manual explicitly indicates that the computer- filled out both. Item-scale homogeneity was then
generated narrative report is considered a “pro- assessed through computation of internal consis-
fessional to professional” consultation, and that tency. The intent here was not to create “pure”
direct sharing of the report’s explicit content with and “independent” scales as a factor-analytic
either the patient or relatives of the patient is approach might yield, as the very theory dic-
strongly discouraged. tates that some of the scales correlate substantially
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

180 Part Two. Dimensions of Testing

with each other. Rather, the intent was to iden- no accident because Millon has played a substan-
tify items that statistically correlated at least .30 tial role in some of the work that resulted in the
or above with the total score on the scale they DSM.
belonged to, as defined by the initial clinical judg-
ment and theoretical stance. In fact, the median Millon’s theory. Millon’s theory about disorders
correlation of items that were retained was about of the personality is deceptively simple and based
.58. Items that showed extreme endorsement fre- on two dimensions. The first dimension involves
quencies, less than 15% or greater than 85% were positive or negative reinforcement – that is, gain-
eliminated (you recall from Chapter 2, that such ing satisfaction vs. avoiding psychological dis-
items are not very useful from a psychometric comfort. Patients who experience few satisfac-
point of view). These and additional screening tions in life are detached types; those who evaluate
steps resulted in a 289-item research form, that satisfaction in terms of the reaction of others are
included both true and false keyed responses, and dependent types. Where the satisfaction is evalu-
items that were scored on multiple scales. ated primarily by one’s own values with disregard
In the third stage, two major studies were car- for others we have an independent type, and those
ried out. In the first study, 200 experienced clini- who experience conflict between their values and
cians administered the experimental form of the those of others are ambivalent personalities. The
MCMI to as many of their patients as feasible second dimension has to do with coping, with
(a total of 682 patients), and rated each patient, maximizing satisfaction and minimizing discom-
without recourse to the MCMI responses, on fort. Some individuals are active, and manipu-
a series of comprehensive and standard clinical late or arrange events to achieve their goals; oth-
descriptions that paralleled the 20 MCMI dimen- ers are passive, and “cope” by being apathetic,
sions. An item analysis was then undertaken to resigned, or simply passive. The four patterns
determine if each item correlated the highest of reinforcement and the two patterns of cop-
with its corresponding diagnostic category. This ing result in eight basic personality styles: active
resulted in 150 items being retained, apparently detached, passive detached, active independent,
each item having an average scale overlap of about and so on. These eight styles are of course assessed
4 – that is, each item is scored or belongs to about by each of the eight basic personality scales of the
4 scales on average, although on some scales the MCMI. Table 7.3 illustrates the parallel.
keyed response is “true” and on some scales the Millon believes that such patterns or styles
keyed response for the same item is “false.” Note are deeply ingrained and that a patient is often
however, that this overlap of items, which occurs unaware of the presence of such patterns and
on such tests as the MMPI, is here not a function their maladaptiveness. If the maladjustment con-
of mere correlation but is dictated by theoretical tinues, the basic maladaptive personality pattern
expectations. becomes more extreme, as reflected by the three
The results of this first study indicated that personality disorder scales S, C, and P. Distor-
three scales were not particularly useful, and tions of the basic personality patterns can also
so the three scales (hypochondriasis, obsession- result in clinical-syndrome disorders, but these
compulsion, and sociopathy) were replaced by are by their very nature transient and depend
three new scales (hypomanic, alcohol abuse, upon the amount of stress present. Scales 12 to 20
and drug abuse). This meant that a new set of assess these disorders, with scales 12 through 17
items was developed, added to the already avail- assessing those with moderate severity, and scales
able MCMI items, and most of the steps out- 18 through 20 assessing the more severe disor-
lined above were repeated. This finally yielded ders. Although there is also a parallel between
175 items, with 20 scales ranging in length the eight basic personality types and the clinical-
from 16 items (Psychotic Delusion) to 47 items syndrome disorders, the correspondence is more
(Hypomanic). complex, and is not a one to one. For exam-
ple, neurotic depression or what Millon (1987)
Parallel with DSM. One advantage of the MCMI calls dysthimia (scale 15) occurs more com-
is that its scales and nosology are closely allied monly among avoidant, dependent, and passive
with the most current DSM classification. This is aggressive personalities. Note that such a theory
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 181

Table 7–3. Personality Patterns and Parallel MCMI Scales administered individually,
but could be used in a group
Type of personality MCMI scale Can become:
setting. As the manual indi-
Passive detached Schizoid Schizotypal cates, the briefness of the test
Active detached Avoidant Schizotypal
and its easy administration
Passive dependent Dependent Borderline
Active dependent Histrionic Borderline by an office nurse, secretary,
Passive independent Narcissistic Paranoid or other personnel, makes
Active independent Antisocial Paranoid it a convenient instrument.
Passive ambivalent Compulsive Borderline &/or Paranoid The instructions are clear
Active ambivalent Passive aggressive Borderline &/or Paranoid
and largely self-explanatory.

focuses on psychopathology; it is not a theory of Scoring. Hand scoring templates are not avail-
normality. able, so the user is required to use the commer-
cially available scoring services. Although this
Description. There are 22 clinical scales in the may seem to be driven by economic motives, and
1987 version of the MCMI organized into three probably is, the manual argues that hand scoring
broad categories to reflect distinctions between so many scales leads to errors of scoring, and even
persistent personality features, current symptom more important, as additional research data are
states, and level of pathologic severity. These obtained, refinements in scoring and in norma-
three categories parallel the three axis of the tive equivalence can be easily introduced in the
DSM-III; hence the “multiaxial” name. One of computer scoring procedure, but not so easily in
the distinctions made in DSM-III is between outdated templates. The manual does include a
more enduring personality characteristics of the description of the item composition of each scale,
patient (called Axis II) and more acute clinical so a template for each scale could be constructed.
disorders they manifest (Axis I). In many ways Computer scoring services are available from the
this distinction parallels the chronic vs. acute, test publisher, including a computer generated
morbid vs. premorbid terminology. The MCMI narrative report.
is one of the few instruments that is fully con-
sonant with this distinction. There are also four Coding system. As with the MMPI there is a pro-
validity scales. Table 7.4 lists the scales with some file coding system that uses a shorthand nota-
defining descriptions. tion to classify a particular profile, by listing the
The MCMI, like the MMPI and the CPI, is basic personality scales (1–8), the pathological
an open system and Millon (1987) suggests that personality disorder scales (S, C, P), the moder-
investigators may wish to use the MCMI to con- ate clinical syndrome scales (A, H, N, D, B, T)
struct new scales by, for example, item analyses and the severe clinical syndrome scales (SS, CC,
of responses given by a specific diagnostic group PP), in order of elevation within each of these
vs. responses given by an appropriate control or four sections.
comparison group. New scales can also be con-
structed by comparing contrasted groups – for Decision theory. We discussed in Chapter 3 the
example, patients who respond favorably to a notions of hits and errors, including false pos-
type of psychotherapy vs. those who don’t. The itives and false negatives. The MCMI incor-
MCMI was revised in 1987 (see Millon & Green, porates this into the scale guidelines, and its
1989, for a very readable introduction to the manual explicitly gives such information. For
MCMI-II). Two scales were added, and responses example, for scale 1, the schizoid scale, the base
to items were assigned weights of 3, 2, or 1 to opti- rate in the patient sample was .11 (i.e., 11%
mize diagnostic accuracy and diminish interscale of the patients were judged to exhibit schizoid
correlations. symptoms). Eighty-eight percent of patients who
were diagnosed as schizoid, in fact, scored on
Administration. The MCMI consists of 175 the MCMI above the cutoff line on that scale.
true-false statements and requires at least an Five percent of those scoring above the cutoff line
eighth-grade reading level. The MCMI is usually were incorrectly classified; that is, their diagnosis
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

182 Part Two. Dimensions of Testing

Table 7–4. Scales on the MCMI


Scale (number of items) High scorers characterized by:
A. Basic Personality Patterns. These reflect everyday ways of functioning that characterize patients.
They are relatively enduring and pervasive traits.
1. Schizoid (asocial) (35) Emotional blandness, impoverished thought processes
2. Avoidant (40) Undercurrent of sadness and tension; socially isolated;
feelings of emptiness
3. Dependent (Submissive) (37) Submissive; avoids social tension
4. Histrionic (Gregarious) (40) Dramatic but superficial affect; immature and childish
5. Narcissistic (49) Inflated self-image
6a. Antisocial (45) Verbally and physically hostile
6b. Aggressive (45) Aggressive
7. Compulsive (conforming) (38) Tense and overcontrolled; conforming and rigid
8a. Passive-Aggressive (Negativistic) (41) Moody and irritable; discontented and ambivalent
8b. Self-defeating personality (40) Self-sacrificing; masochistic
B. Pathological Personality Disorders. These scales describe patients with chronic severe pathology.
9. Schizotypal (Schizoid) (44) Social detachment and behavioral eccentricity
10. Borderline (Cycloid) (62) Extreme cyclical mood ranging from depression to
excitement
11. Paranoid (44) Extreme suspicion and mistrust
C. Clinical symptom syndromes. These nine scales represent symptom disorders, usually of briefer
duration than the personality disorders, and often are precipitated by external events.
12. Anxiety (25) Apprehensive; tense; complains of many physical
discomforts
13. Somatoform (31) Expresses psychological difficulties through physical
channels (often nonspecific pains and feelings of ill
health)
14. Bipolar-manic (37) Elevated but unstable moods; overactive, distractable,
and restless
15. Dysthymia (36) Great feelings of discouragement, apathy, and futility
16. Alcohol dependence (46) Alcoholic
17. Drug dependence (58) Drug abuse
18. Thought disorder (33) Schizophrenic; confused and disorganized
19. Major depression (24) Severely depressed; expresses dread of the future
20. Delusional disorder (23) Paranoid, belligerent, and irrational
D. Validity scales
21. Weight factor (or disclosure level). This is not really a scale as such, but is a score adjustment
applied under specific circumstances. It is designed to moderate the effects of either excessive
defensiveness or excessive emotional complaining (i.e., fake good and fake bad response sets).
22. Validity index. Designed to identify patients who did not cooperate or did not answer relevantly
because they were too disturbed. The scale is composed of 4 items that are endorsed by fewer
than 1 out of 100 clinical patients. Despite its brevity, the scale seems to work as intended.
23. Desirability gauge. The degree to which the respondent places him/herself in a favorable light
(i.e., fake good).
24. The Debasement measure. The degree to which the person depreciates or devalues themselves
(i.e., fake bad).

was not schizoid and, therefore, they would be raw score on a scale is changed into a T score
classified as false positives. The overall hit rate or some other type of standard score. This pro-
for this scale is 94%. cedure assumes that the underlying dimension
is normally distributed. Millon (1987) argues
that this is not the case when a set of scales is
Base-rate scores. The MCMI uses a rather designed to represent personality types or clin-
unique scoring procedure. On most tests, the ical syndromes because they are not normally
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 183

distributed in patient populations. The aim of values in the .80 to .85 range. For the second sam-
scales such as those on the MCMI is to identify ple, the coefficients range from .61 to .85, with a
the degree to which a patient is or is not a member median of about .77. Because all the patients were
of a diagnostic entity. And so Millon conducted involved in psychotherapy programs, we would
two studies of more than 970 patients in which expect the 5-week reliabilities to be lower. We
clinicians were asked to diagnose these patients would also expect, and the results support this,
along the lines of the MCMI scales. These stud- the personality pattern scales to be highest in reli-
ies provided the basic base-rate data. Millon was ability, followed by the pathological personality
able to determine what percentage of the patients scales, and least reliable, the clinical syndromes
were judged to display specific diagnostic fea- (because most changeable and transient).
tures, regardless of their actual diagnosis, and to Internal consistency reliability (KR 20) was
determine the relative frequency of each diagnos- also assessed in two samples totaling almost 1,000
tic entity. For example, 27% of the patient sample patients. These coefficients range from .58 to .95,
was judged to exhibit some histrionic personal- with a median of .88; only one scale, the 16 item
ity features, but only 15% were assigned this as PP scale, which is the shortest scale, has a KR
their major diagnosis. Based on these percent- reliability of less than .70.
ages then, base-rate scores were established for
each of the clinical scales, including an analysis Validity. A number of authors, such as Loevinger
of false positives. Despite the statistical and log- (1957) and Jackson (1970) have argued that val-
ical sophistication of this method, the final step idation should not simply occur at the end of
of establishing base-rate scores was very much a test’s development, but should be incorpo-
a clinical-intuitive one, where a base-rate score rated in all phases of test construction. That
of 85 was arbitrarily assigned as the cutoff line seems to be clearly the case with the MCMI; as
that separated those with a specific diagnosis and we have seen above, its development incorpo-
those without that diagnosis, a base-rate score rated three distinct validational stages. We can
of 60 was arbitrarily selected as the median, and also ask, in a more traditional manner, about
a base rate of 35 was arbitrarily selected as the the validity of the resulting scales. The MCMI
“normal” median. If the above discussion seems manual presents correlations of the MCMI scales
somewhat vague, it is because the test manual is with scales from other multivariate instruments,
rather vague and does not yield the specific details namely the MMPI, the Psychological Screening
needed. Inventory, and the SCL-90. It is not easy to sum-
The idea of using base rates as a basis of scoring marize such a large matrix of correlations, but
is not a new idea; in fact, one of the authors of the in general the pattern of correlations for each
MMPI has argued for, but not implemented, such MCMI scale supports their general validity, and
an approach (Meehl & Rosen, 1955). One prob- the specific significant correlations seem to be
lem with such base rates is that they are a function in line with both theoretical expectations and
of the original sample studied. A clinician work- empirically observed clinical syndromes. (For a
ing with clients in a drug-abuse residential set- comparison of the MCMI-II and the MMPI see
ting would experience rather different base rates McCann, 1991).
in that population than a clinician working in an
outpatient setting associated with a community Norms. Norms on the MCMI are based on a sam-
mental-health clinic, yet both would receive test ple of 297 normal subjects ranging in age from
results on their clients reflective of the same base 18 to 62, and 1,591 clinical patients ranging in
rate as found in a large research sample. age from 18 to 66. These patients came from
more than 100 hospitals and outpatient centers,
Reliability. The manual presents test-retest reli- as well as from private psychotherapists in the
ability for two samples: 59 patients tested twice United States and Great Britain. These samples
with an average interval of 1 week, and 86 patients are basically samples of convenience, chosen for
tested twice with an average interval of about 5 their availability, but also reflective of diversity
weeks. For the first sample, the correlation coef- in age, gender, educational level, and socioeco-
ficients range from .78 to .91 with most of the nomic status.
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

184 Part Two. Dimensions of Testing

By 1981, MCMI protocols were available on different. The first factor seems to be more of a
more than 43,000 patients and these data were general psychopathology factor, the second fac-
used to refine the scoring/normative procedure. tor a social acting-out and aggressive dimension
For the MCMI-II, a sample of 519 clinicians related to drug abuse, and a third dimension
administered the MCMI and the MCMI-II to a (factor 4) reflects alcohol abuse and compulsive
total of 825 patients diagnosed using the DSM- behavior. These results can be viewed from two
III-R criteria. Another 93 clinicians administered different perspectives: Those who seek factorial
the MCMI-II to 467 diagnosed patients. invariance would perceive such differing results
in a negative light, as reflective of instability in the
Scale intercorrelations. As we have seen, scales test. Those who seek “clinical” meaning would see
on the MCMI do correlate with each other such results in a positive light, as they correspond
because there is item overlap, and because the to what would be predicted on the basis of clinical
theoretical rationale for the 20 dimensions dic- theory and experience.
tates such intercorrelations. Empirically, what is
the magnitude of such relationships? The Man- Family of inventories. The MCMI is one of a
ual presents data on a sample of 978 patients. family of inventories developed by Millon. These
Correlating 20 scales with each other yields some include the Millon Behavioral Health Inventory
190 coefficients (20 × 19 divided by 2 to elimi- (Millon, Green, & Meagher, 1982b), which is
nate repetition). These coefficients range from a for use with medical populations such as cancer
high of .96 (between scales A and C) to a low of patients or rehabilitation clients; and the Mil-
-.01 (between scales B and PP), with many of the lon Adolescent Personality Inventory (Millon,
scales exhibiting substantial correlations. Greene, and Meagher, 1982a) for use with junior
A factor analyst looking at these results would and senior high-school students.
throw his or her hands up in despair, but Millon
argues for the existence of such correlated but Criticisms. The MCMI has not supplanted the
separate scales on the basis of their clinical utility. MMPI and in the words of one reviewer “this
As I tell my classes, if we were to measure shirt carefully constructed test never received the
sleeves we would conclude from a factor analytic attention it merited” (A. K. Hess, 1985, p. 984).
point of view that only one such measurement is In fact, A. K. Hess (1985) finds relatively lit-
needed, but most likely we would still continue to tle to criticize except that the MCMI’s focus on
manufacture shirts with two sleeves rather than psychopathology may lead the practitioner to
just one. overemphasize the pathological aspects of the
client and not perceive the positive strengths a
Factor analysis. The MCMI manual reports the client may have. Other reviewers have not been so
results of two factor analyses, one done on a gen- kind. Butcher and Owen (1978) point out that the
eral psychiatric sample (N = 744), and one on a use of base rates from Millon’s normative sam-
substance abuse sample (N = 206). For the gen- ple will optimize accurate diagnosis only when
eral psychiatric sample, the factor analysis sug- the local base rates are identical. J. S. Wiggins
gested four factors, with the first three accounting (1982) criticized the MCMI for the high degree
for 85% of the variance. These factors are rather of item overlap. Widiger and Kelso (1983) indi-
complex. For example, 13 of the 20 scales load cated that such built-in interdependence does not
significantly on the first factor which is described allow one to use the MCMI to determine the
as “depressive and labile emotionality expressed relationship between disparate disorders. This
in affective moodiness and neurotic complaints” is like asking “What’s the relationship between
(Millon, 1987). In fact, the first three factors par- X and Y?” If one uses a scale that correlates
allel a classical distinction found in the abnormal with both X and Y to measure X, the obtained
psychology literature of affective disorders, para- results will be different than if one had used
noid disorders, and schizophrenic disorders. a scale that did not correlate with Y. Widiger,
The results of the factor analysis of the Williams, Spitzer, et al., (1985; 1986) questioned
substance-abuse patients also yielded a four fac- whether the MCMI is a valid measure of person-
tor solution, but the pattern here is somewhat ality disorders as listed in the DSM, arguing that
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 185

Millon’s description of specific personality styles Content validity. The items were given to 4 clini-
was divergent from the DSM criteria. In fact, cians to sort into the 11 personality disorder cate-
Widiger and Sanderson (1987) found poor con- gories. A variety of analyses were then carried out
vergent validity for those MCMI scales that were basically showing clinicians’ agreement. Where
defined differently from the DSM and poor dis- there was disagreement in the sorting of items,
criminant validity because of item overlap. (For a the disagreement was taken as reflecting the fact
review of standardized personality disorder mea- that several of the personality disorders overlap
sures such as the MCMI see J. H. Reich, 1987; in symptomatology.
1989; Widiger & Frances, 1987). Nevertheless, the
MCMI has become one of the most widely used Normative sample. The major normative sam-
clinical assessment instruments, has generated a ple is composed of 1,230 subjects, that includes
considerable body of research literature, has been 368 patients and 862 normals who were recruited
revised, and used in cross-cultural studies (R. I. from the general population by newspaper adver-
Craig, 1999). tisements, classroom visits, solicitation of vis-
itors to the University Hospital and so on.
Although the authors give some standard demo-
OTHER MEASURES
graphic information such as gender, education,
and age, there is little other information given;
The Wisconsin Personality Disorders
for example, where did the patients come from
Inventory (WISPI)
(hospitalized? outpatients? community clinic?
The DSM has served as a guideline for a rather university hospital?), and what is their diagnosis?
large number of tests, in addition to the MCMI, Presumably, patients are in therapy, but what
many focusing on specific syndromes, and some kind and at what stage is not given. Clearly these
more broadly based. An example of the latter is subjects are samples of convenience; the average
the WISPI (M. H. Klein, et al., 1993), a relative age of the normal subjects is given as 24.4 which
newcomer, chosen here not because of its excep- suggests a heavy percentage of captive college-
tional promise, but more to illustrate the diffi- aged students.
culties of developing a well functioning clinical
instrument. Reliability. Interitem consistency was calculated
for each of the 11 scales; alpha coefficients range
Development. Again, the first step was to from a low of .84 to a high of .96, with an aver-
develop an item pool that reflected DSM crite- age of .90, in the normative sample. Test-retest
ria and, in this case, reflected a particular the- coefficients for a sample of 40 patients and 40
ory of interpersonal behavior (L. S. Benjamin, nonpatients who were administered the WISPI
1993). One interesting approach used here was twice within 2 weeks ranged from a low of .71
that the items were worded from the perspective to a high of .94, with an average of .88. Two
of the respondent relating to others. For example, forms of the WISPI were used, one a paper-and-
rather than having an item that says, “People say pencil form, the other a computer-interview ver-
I am cold and aloof,” the authors wrote, “When I sion, with administration counterbalanced. The
have feelings I keep them to myself because others results suggest that the two forms are equally
might use them against me.” A total of 360 items reliable.
were generated that covered the 11 personality-
disorder categories, social desirability, and some Scale intercorrelations. The scales correlate sub-
other relevant dimensions. The authors do not stantially with each other from a high of .82
indicate the procedures used to eliminate items, (between the Histrionic and the Narcissistic
and indeed the impression one gets is that all Scales), to a low of .29 (between the Histrionic
items were retained. Respondents are asked to and the Schizoid Scales); the average intercor-
answer each item according to their “usual self” relation is .62. This is a serious problem and
over the past 5 years or more, and use a 10-point the authors recognize this; they suggest various
scale (where 1 is never or not at all true, and 10 methods by which such intercorrelations can be
is always or extremely true). lowered.
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

186 Part Two. Dimensions of Testing

Concurrent validity. Do the WISPI scales dis- social situations, odd beliefs, eccentric behav-
criminate between patients and nonpatients? You ior, and odd speech. A number of scales have
recall that this is the question of primary validity been developed to assess this personality disor-
(see Chapter 3). Eight of the 11 scales do, but the der, although most seem to focus on just a few of
Histrionic, Narcissistic, and Antisocial Scales do the nine criteria. An example of a relatively new
not. Here we must bring up the question of sta- and somewhat unknown scale that does cover all
tistical vs. clinical significance. Take for example nine criteria is the SPQ (Raine, 1991). The SPQ
the Paranoid Scale, for which the authors report is modeled on the DSM-III-R criteria, and thus
a mean of 3.5 for patients (n = 368) and a mean the nine criteria served both to provide a theo-
of 3.08 for nonpatients (n = 852). Given the large retical framework, a blueprint by which to gen-
size of the samples, this rather small difference of erate items, and a source for items themselves.
.42 (which is a third of the SD) is statistically sig- Raine (1991) first created a pool of 110 items,
nificant. But the authors do not provide an anal- some taken from other scales, some paraphras-
ysis of hits and errors that would give us informa- ing the DSM criteria, and some created new.
tion about the practical or clinical utility of this These items, using a true-false response, were
scale. If I use this scale as a clinician to make diag- administered to a sample of 302 undergraduate
nostic decisions about patients, how often will I student volunteers, with the sample divided ran-
be making errors? domly into two subsamples for purposes of cross-
How well do the WISPI scales correlate with validation. Subscores were obtained for each of
their counterparts on the MCMI? The average the nine criterion areas and item-total correla-
correlation is reported to be .39, and they range tions computed. Items were deleted if fewer than
from −.26 (for the Compulsive Scale) to .68 10% endorsed them or if the item-total correla-
(for the Dependent Scale). Note that, presum- tion was less than .15.
ably, these two sets of scales are measuring the A final scale of 74 items, taking 5 to 10 min-
same dimensions and therefore ought to correlate utes to complete, was thus developed. Table 7.5
substantially. We should not necessarily conclude lists the nine subscales or areas and an illustrative
at this point that the MCMI scales are “better,” example.
although it is tempting to do so. What would be In addition to the pool of items, the subjects
needed is a comparison of the relative diagnostic completed four other scales, two that were mea-
efficiency of the two sets of scales against some sures of schizotypal aspects and two that were
nontest criterion. The WISPI is too new to eval- not. This is of course a classical research design
uate properly, and only time will tell whether the to obtain convergent and discriminant validity
test will languish in the dusty journal pages in the data (see Chapter 3). In addition, students who
library or whether it will become a useful instru- scored in the lowest and highest 10% of the dis-
ment for clinicians. tribution of scores were invited to be interviewed
by doctoral students; the interviewers then inde-
pendently assessed each of the 25 interviewees on
The Schizotypal Personality
the diagnosis of schizotypal disorder and on each
Questionnaire (SPQ)
of the nine dimensions.
In addition to the multivariate instruments such
as the MMPI and the MCMI, there are specific Reliability. Coefficient alpha for the total score
scales that have been developed to assess partic- was computed as .90 and .91 in the two sub-
ular conditions. One of the types of personality samples. Coefficient alpha for the nine subscales
disorders listed in the DSM-III-R is that of schizo- ranged from .71 to .78 for the final version. Note
typal personality disorder. Individuals with this here somewhat of a paradox. The alpha values
disorder exhibit a “pervasive pattern of peculiar- for each of the subscales are somewhat low, sug-
ities of ideation, appearance, and behavior,” and gesting that each subscale is not fully homoge-
show difficulties in interpersonal relations not neous. When the nine subscales are united, we of
quite as extreme as those shown by schizophren- course have both a longer test and a more hetero-
ics. There are nine diagnostic criteria given for geneous test; one increases reliability, the other
this disorder, which include extreme anxiety in decreases internal consistency. The result, in this
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 187

Table 7–5. The Schizotypal Personality Questionnaire tension and apprehension, cou-
pled with heightened autonomic
Subscale Illustrative item
nervous system activity. Trait anx-
1. Ideas of reference People are talking about me. iety refers to relatively stable
2. Excessive social anxiety I get nervous in a group.
individual differences in anxiety
3. Odd beliefs or magical I have had experiences with the
thinking supernatural. proneness; i.e., the tendency to
4. Unusual perceptual When I look in the mirror my face respond to situations perceived
experiences changes. as threatening with elevations in
5. Odd or eccentric People think I am strange. state anxiety intensity. People suf-
behavior
fering from anxiety often appear
6. No close friends I don’t have close friends.
7. Odd speech I use words in unusual ways. nervous and apprehensive and
8. Constricted affect I keep my feelings to myself. typically complain of heart palpi-
9. Suspiciousness I am often on my guard. tations and of feeling faint; it is
not unusual for them to sweat pro-
fusely and show rapid breathing.
case, was that internal consistency was increased
substantially.
Development. The STAI was developed begin-
For the 25 students who were interviewed, test-
ning in 1964 through a series of steps and pro-
retest reliability with a 2-month interval was .82.
cedures somewhat too detailed to summarize
Note that this is an inflated value because it is
here (see the STAI manual for details; Spiel-
based on a sample composed of either high or low
berger, Gorsuch, Lushene, et al., 1983). Initially,
scores, and none in between. The greatest degree
the intent was to develop a single scale that would
of intragroup variability occurs in the mid range
measure both state and trait anxiety, but because
rather than at the extremes.
of linguistic and other problems, it was eventu-
ally decided to develop different sets of items to
Validity. Of the 11 subjects who were high scor- measure state and trait anxiety.
ers, 6 were in fact diagnosed as schizotypal; of the Basically, three widely used anxiety scales were
14 low scoring subjects, none were so diagnosed. administered to a sample of college students.
When the SPQ subscores were compared with the Items that showed correlations of at least .25 with
ratings given by the interviewers, all correlations each of the three anxiety scale total scores were
were statistically significant, ranging from a low selected and rewritten so that the item could be
of .55 to a high of .80. Unfortunately, only the used with both state and trait instructions. Items
coefficients for the same named dimensions are were then administered to another sample of col-
given. For example, the Ideas of Reference Scale lege students, and items that correlated at least .35
scores correlate .80 with the ratings of Ideas of with total scores (under both sets of instructions
Reference, but we don’t know how they correlate designed to elicit state and trait responses) were
with the other eight dimensions. For the entire retained. Finally, a number of steps and studies
student sample, convergent validity coefficients were undertaken that resulted in the present form
were .59 and .81, while discriminant validity coef- of two sets of items that functioned differently
ficients were .19 and .37. under different types of instructional sets (e.g.,
“Make believe you are about to take an impor-
tant final examination”).
The State-Trait Anxiety Inventory (STAI)
Introduction. Originally, the STAI was devel- Description. The STAI consists of 40 statements,
oped as a research instrument to assess anxiety divided into 2 sections of 20 items each. For the
in normal adults, but soon found usefulness with state portion, the subject is asked to describe how
high-school students and with psychiatric and he or she feels at the moment, using the four
medical patients. The author of the test (Spiel- response options of not at all, somewhat, mod-
berger, 1966) distinguished between two kinds of erately so, and very much so. Typical state items
anxiety. State anxiety is seen as a transitory emo- are: “I feel calm” and “I feel anxious.” For the
tional state characterized by subjective feelings of trait portion, the subject is asked to describe how
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

188 Part Two. Dimensions of Testing

he or she generally feels, using the four response MCMI, this is as it should be because validity
options of almost never, sometimes, often, and should not be an afterthought but should be
almost always. Typical trait items are: “I am incorporated into the very genesis of a scale.
happy” and “I lack self-confidence.” There are Concurrent validity is presented by correla-
five items that occur on both scales, three of them tions of the STAI trait score with three other mea-
with identical wording, and two slightly different. sures of anxiety. These correlations range from a
low of .41 to a high of .85, in general supporting
Administration. The STAI can be administered the validity of the STAI. Note here somewhat of
individually or in a group, has no time limit, a “catch-22” situation. If a new scale of anxiety
requires a fifth to sixth grade reading ability, and were to correlate in the mid to high .90s with an
can be completed typically in less than 15 min- old scale, then clearly the new scale would simply
utes. The two sets of items with their instructions be an alternate form of the old scale, and thus of
are printed on opposite sides of a one-page test limited usefulness.
form. The actual questionnaire that the subject Other validity studies are also reported in the
responds to is titled, “Self-evaluation Question- STAI manual. In one study, college students were
naire” and the term anxiety is not to be used. The administered the STAI state scale under standard
state scale is answered first, followed by the trait instructions (how do you feel at the moment),
scale. and then readministered the scale according to
“How would you feel just prior a final examina-
Scoring. Scoring is typically by hand using tem- tion in an important course.” For both males and
plates, but one can use a machine-scored answer females total scores were considerably higher in
sheet. For the state scale, 10 of the items are scored the exam condition than in the standard condi-
on a 1 to 4 scale, depending upon the subject’s tion, and only one of the 20 items failed to show
response, and for 10 of the items the scoring a statistically significant response shift.
is reversed, so that higher scores always reflect In another study, the STAI and the Person-
greater anxiety. For the trait scale, only seven of ality Research Form (discussed in Chapter 4)
the items are reversed in scoring. were administered to a sample of college stu-
dents seeking help at their Counseling Center
Reliability. The test manual indicates that inter- for either vocational-educational problems or for
nal consistency (alpha) coefficients range from emotional problems. The mean scores on the
.83 to .92 with various samples, and there seems to STAI were higher for those students with emo-
be no significant difference in reliability between tional problems. In addition, many of the cor-
the state and trait components. Test-retest coef- relations between STAI scores and PRF variables
ficients are also given for various samples, with were significant, with the highest correlation of
time periods of 1 hour, 20 days, and 104 days. For .51 between STAI trait scores and the Impulsiv-
the state scale the coefficients range from .16 to ity Scale of the PRF for the clients with emo-
.54, with a median of about .32. For the trait scale tional problems. Interestingly, the STAI and the
coefficients range from .73 to .86, with a median EPPS (another personality inventory discussed
of about .76. For the state scale, the results are in Chapter 4) do not seem to correlate with each
inadequate but the subjects in the 1-hour test- other. STAI scores are also significantly correlated
retest condition were exposed to different treat- with MMPI scores, some quite substantially – for
ments, such as relaxation training, designed to example, an r of .81 between the STAI trait score
change their state scores, and the very instruc- and the MMPI Pt (Psychasthenia) score, and .57
tions reflect unique situational factors that exist at between both the STAI trait and state scores and
the time of testing. Thus for the state scale, a more the MMPI depression scale.
appropriate judgment of its reliability is given by In yet another study reported in the test man-
the internal consistency coefficients given above. ual, scores on the STAI trait scale were signif-
icantly correlated with scores on the Mooney
Validity. In large part, the construct validity of Problem Checklist, which, as its title indicates, is
the STAI was assured by the procedures used a list of problems that individuals can experience
in developing the measure. As we saw with the in a wide variety of areas. Spielberger, Gorsuch,
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 189

Table 7–6. Symptom-Attitude Categories of the BDI


1. Mood 8. Self-accusations 15. Work inhibitions
2. Pessimism 9. Suicidal wishes 16. Sleep disturbance
3. Sense of failure 10. Crying speels 17. Fatigability
4. Dissatisfaction 11. Irritability 18. Loss of appetite
5. Guilt 12. Social withdrawal 19. Weight loss
6. Sense of punishment 13. Indecisiveness 20. Somatic preoccupation
7. Self-dislike 14. Distortion of body image 21. Loss of libido

and Lushene (1970) argue that if students have lent condition affecting one of eight Americans.
difficulties in academic work, it is important to There is thus a practical need for a good mea-
determine the extent to which emotional prob- sure of depression, and many such measures
lems contribute to those difficulties. For a sam- have been developed. The BDI is probably the
ple of more 1,200 college freshmen, their STAI most commonly used of these measures; it has
scores did not correlate significantly with either been used in hundreds of studies (Steer, Beck, &
high-school GPA, scores on an achievement test, Garrison, 1986), and it is the most frequently
or scores on the SAT. Thus, for college students, cited self-report measure of depression (Pon-
STAI scores and academic achievement seem to terotto, Pace, & Kavan, 1989). That this is so
be unrelated. is somewhat surprising because this is one of
the few popular instruments developed by fiat
Norms. Normative data are given in the test (see Chapter 4) and without regard to theoretical
manual for high-school and college samples, notions about the etiology of depression (A. T.
divided as to gender, and for psychiatric, medical, Beck & Beamesderfer, 1974).
and prison samples. Raw scores can be located in
the appropriate table, and both T scores and per-
centile ranks can be obtained directly. Description. The BDI consists of 21 multiple-
choice items, each listing a particular manifesta-
Do state and trait correlate? The two scales do tion of depression, followed by 4 self-evaluative
correlate, but the size of the correlation depends statements listed in order of severity. For exam-
upon the specific situation under which the state ple, with regard to pessimism, the four statements
scale is administered. Under standard conditions, and their scoring weights might be similar to:
that is those prevailing for captive college stu- (0) I am not pessimistic, (1) I am pessimistic
dents who participate in these studies, the corre- about the future, (2) I am pretty hopeless about
lations range from .44 to .55 for females, and .51 the future, and (3) I am very hopeless about the
to .67 for males. This gender difference, which future. Table 7.6 lists the 21 items, also called
seems to be consistent, suggests that males who symptom-attitude categories.
are high on trait anxiety are generally more prone These items were the result of the clinical
to experience anxiety states than are their female insight of Beck and his colleagues, based upon
counterparts. Smaller correlations are obtained years of observation and therapeutic work with
when the state scale is administered under con- depressed patients, as well as a thorough aware-
ditions that pose some psychological threat such ness of the psychiatric literature. The format of
as potential loss of self-esteem or evaluation of the BDI assumes that the number of symptoms
personal adequacy, as in an exam. Even smaller increases with the severity of depression, that
correlations are obtained when the threat is a the more depressed an individual is, the more
physical one, such as electric shock (Hodges & intense a particular symptom, and that the four
Spielberger, 1966). choices for each item parallel a progression from
nondepressed to mildly depressed, moderately
depressed, and severely depressed. The items rep-
The Beck Depression Inventory (BDI)
resent cognitive symptoms of depression, rather
Introduction. Depression is often misdiagnosed than affective (emotional) or somatic (physical)
or not recognized as such, yet it is a fairly preva- symptoms.
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

190 Part Two. Dimensions of Testing

The BDI was intended for use with clinical Scoring. The BDI is typically hand scored, and
populations such as psychiatric patients, and was the raw scores are used directly without any trans-
originally designed to estimate the severity of formation. Total raw scores can range from 0
depression and not necessarily to diagnose indi- to 63, and are used to categorize four levels of
viduals as depressed or not. It rapidly became depression: none to minimal (scores of 0 to 9);
quite popular for both clinical and nonclini- mild to moderate (scores of 10–18); moderate
cal samples, such as college students, to assess to severe (19–29); and severe (30–63). Note that
both the presence and degree of depression. In while there is a fairly wide range of potential
fact, there are probably three major ways in scores, individuals who are not depressed should
which the BDI is used: (1) to assess the inten- score below 10. There is thus a floor effect (as
sity of depression in psychiatric patients, (2) to opposed to a ceiling effect when the range of
monitor how effective specific therapeutic regi- high scores is limited), which means that the BDI
mens are, and (3) to assess depression in normal ought not to be used with normal subjects, and
populations. that low scores may be indicative of the absence
The BDI was originally developed in 1961 and of depression but not of the presence of happi-
was “revised” in 1978 (A. T. Beck, 1978). The ness (for a scale that attempts to measure both
number of items remained the same in both depression and happiness see McGreal & Joseph,
forms, but for the revision the number of alterna- 1993).
tives for each item was standardized to four “Lik-
ert” type responses. A. T. Beck and Steer (1984) Reliability. A. T. Beck (1978) reports the results
compared the 1961 and 1978 versions in two large of an item analysis based on 606 protocols, show-
samples of psychiatric patients and found that ing significant positive correlations between each
both forms had high degrees of internal consis- item and the total score. A corrected split-half
tency (alphas of .88 and .86) and similar patterns reliability of .93 was also reported for a sample of
of item vs. total score correlations. Lightfoot and 97 subjects.
Oliver (1985) similarly compared the two forms Test-retest reliability presents some problems
in a sample of University students, and found the for instruments such as the BDI. Too brief an
forms to be relatively comparable, with a corre- interval would reflect memory rather than stabil-
lation of .94 for the total scores on the two forms. ity per se, and too long an interval would mirror
possible changes that partly might be the result of
therapeutic interventions, “remission,” or more
Administration. Initially, the BDI was adminis- individual factors. A. T. Beck and Beamesder-
tered by a trained interviewer who read aloud fer (1974) do report a test-retest study of 38
each item to the patient, while the patient fol- patients, retested with a mean interval of 4 weeks.
lowed on a copy of the scale. In effect then, the At both test and retest an assessment of depth of
BDI began life as a structured interview. Cur- depression was independently made by a psychi-
rently, most BDIs are administered by having the atrist. The authors report that the changes in BDI
patient read the items and circle the most repre- scores paralleled the changes in the clinical rat-
sentative option in each item; it is thus typically ings of depression, although no data are advanced
used as a self-report instrument, applicable to for this assertion. Oliver and Burkham (1979)
groups. In its original form, the BDI instructed reported a test-retest r of .79 for a sample of col-
the patient to respond in terms of how they were lege students retested over a 3-week interval. In
feeling “at the present time,” even though a num- general, test-retest reliability is higher in nonpsy-
ber of items required by their very nature a com- chiatric samples than in psychiatric samples, as
parison of recent functioning vs. usual function- one might expect, because psychiatric patients
ing, i.e., over an extended period of time. The would be expected to show change on retesting
most recent revision asks the respondent to con- due to intervening experiences, whether thera-
sider how they were feeling over the past few peutic or not; such experiences would not affect
weeks. Incidentally, one advantage of such self- all patients equally.
rating procedures is that they involve the patient Internal consistency reliability seems quite
in the assessment and may thus be therapeutic. adequate; typical results are those of Lightfoot
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 191

and Oliver (1985) who reported a coefficient Secondary validity. In our discussion of sec-
alpha of .87 for a sample of college students. In ondary validity (see Chapter 3), we saw that
their review of 25 years of research on the BDI, Gough (1965) suggested relating scores on a
A. T. Beck, Steer, and Garbin (1988) found that measure to “important” variables. A. T. Beck
the internal consistency of the BDI ranged from and Beamesderfer (1974) undertook just such an
.73 to .95, with a mean alpha value of .81 for analysis and found a small but significant rela-
nonpsychiatric samples and .86 for psychiatric tionship of BDI scores with gender (females scor-
samples. ing higher), none with race, none with age (con-
Yet, it should be pointed out that a large major- trary to the popular belief that older patients are
ity of studies on the BDI do not report any infor- more likely to be depressed), a small but signif-
mation on reliability. Yin and Fan (2000) carried icant relationship with educational attainment
out a meta-analysis of BDI studies, and found that (patients with lesser education tended to score
only 7.5% reported meaningful reliability infor- higher), a “slight” (but presumably insignificant)
mation. They found that test-retest reliability is correlation with vocabulary scores, and a signif-
lower than internal consistency reliability, and icant but explainable negative correlation with
that reliability estimates obtained from studies social desirability, in that depressed patients do
of substance addicts were lower than those from select “unfavorable” alternatives.
studies of normal subjects.
Cross-cultural studies. The BDI has been used
Validity. There is a voluminous body of literature in a substantial number of cross-cultural stud-
on the BDI, most supportive of its validity. In the ies in a variety of countries, ranging from the
area of concurrent validity, most studies show former Czechoslovakia and Switzerland (A. T.
correlations in the .55 to .65 and above range of Beck & Beamesderfer, 1974) to Iran (Tashakkori,
BDI scores with clinicians’ ratings of depression Barefoot, & Mehryar, 1989) and Brazil (Goren-
(e.g., Metcalfe & Goldman, 1965) and correla- stein, Andrade, Filho, et al., 1999), and has been
tions in the .70s with other standardized mea- translated into a wide range of languages includ-
sures of depression, such as the MMPI D scale ing Chinese, German, Korean, and Turkish (see
(e.g., Nussbaum, Wittig, Hanlon, et al., 1963) and Naughton & Wiklund, 1993, for a brief review
the Zung Self-rating Depression Scale, another of these studies). The results are supportive of
well-known measure of depression (Zung, 1965). its reliability and validity across various cultures,
With regard to content validity, recent ver- with some minor exceptions.
sions of the DSM list nine diagnostic criteria for
depression; the BDI covers six of these (P. W. Short form. A. T. Beck and Beamesderfer (1974)
Moran & Lambert, 1983). discuss a brief form of the BDI composed of 13
A. T. Beck and Beamesderfer (1974) discuss the items, for use by general practitioners and by
construct validity of the BDI by relating a series of researchers for the rapid screening of potentially
studies designed to assess such hypotheses as “Are depressed patients. The items were chosen on the
depressed patients more likely to have a negative basis of their correlation with the total scale and
self-image and more likely to have dreams char- with clinicians’ ratings of depression (A. T. Beck,
acterized by masochistic content?” The results of Rial, & Rickels, 1974). The 13-item total score
these studies supported such hypotheses and the correlated .96 with the total on the standard form.
construct validity of the BDI. Internal consistency of the short form has ranged
The BDI has been used to differentiate psychi- from about .70 to .90 (e.g., Gould, 1982; Leahy,
atric patients from normal individuals in both 1992; Vredenburg, Krames, & Flett, 1985).
adult and adolescent populations and to differ-
entiate levels of severity of depression. The BDI Factor analysis. A variety of investigators have
seems to be sensitive to changes in depression that assessed BDI data using factor-analytic tech-
result from medications and other therapeutic niques, with results reflecting a variety of
interventions. Scores on the BDI correlate with a obtained factors. A. T. Beck and Beamesderfer
variety of conditions, such as suicidal behavior, (1974) report a number of these studies that
that might be hypothesized to be related. range from 1 general factor of depression, to 3, 4,
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

192 Part Two. Dimensions of Testing

and even 10 additional factors. Beck and Beames- scored 10 or more on the initial BDI, but 22 of the
derfer (1974) themselves obtained three factors 43 scored below 10 upon retesting a week later.
that they labeled as “negative view,” “physio-
logical,” and “physical withdrawal.” Weckowicz, Why use the BDI? Why use an instrument like
Muir, and Cropley (1967) also found three fac- the BDI, or for that matter any test, rather than
tors labeled as “guilty depression,” “retarded depend upon the professional judgment of a
depression,” and “somatic disturbance,” and clinician? If we think back to the notion of stan-
corresponding to those factors found by other dardization discussed in Chapter 3, then the
investigators in more general investigations of answer will be obvious. Not all clinicians are
depression. Recent studies (e.g., Byrne & Baron, highly experienced, and even those who are may
1994) also have reported three factors but with in fact be inconsistent in their application of diag-
some cross-cultural differences, at least within nostic criteria. The criteria themselves may be
French Canadian vs. English Canadian adoles- inadequately specified (Ward, Beck, Mendelson,
cents. Endler, Rutherford, and Denisoff (1999) et al., 1962). An instrument such as the BDI is
however, reported two factors for a sample of well standardized, economical to use, not depen-
Canadian students – a cognitive-affective dimen- dent on the interviewer’s theoretical orientation
sion and a physiological dimension. Whether the or clinical sagacity, and yields a score that can
BDI measures depression in a unidimensional be used to assess changes due to medications,
global manner or whether the scale is composed psychotherapy, or other treatments.
of several replicable factors remains an open issue On the other hand, comparisons between self-
(e.g., Welch, Hall, & Walkey, 1990). ratings and expert ratings often do not produce
substantial correlations, as might be expected.
For example, Kearns et al. (1982) compared five
How high is high? As mentioned earlier psy- self-assessment measures of depression, includ-
chological measurement is usually relative, and ing the BDI, with two interview-based measures.
raw scores are of themselves meaningless. Given The authors concluded that the self-rating mea-
the possible range of scores on the BDI from sures showed poor performance and suggested
0 to 63, what score indicates the presence of abandoning their use.
depression? There is no such specific score, since
the meaning and usefulness of a specific or cutoff Criticisms. No instrument escapes criticism, and
score depends upon a number of aspects. Here the BDI is no exception. Gotlib (1984) has argued
again the notions of decision theory (discussed that in college students the BDI is a measure
in Chapter 3) are relevant. The usefulness of any of “general psychopathology” rather than just
particular cutoff score is a function of the relative depression (cf. Hill, Kemp-Wheeler, & Jones,
frequencies of false positives and false negatives. 1986). Other authors have pointed out that in
For example, if we wish to minimize false posi- its early version, the BDI was administered in
tives (high scorers who are not depressed), and a clinical context where both interviewer and
we are not concerned about false negatives (indi- patient agreed upon the aim of the interview,
viduals who really are depressed but are not rec- i.e., to obtain some factual information about
ognized as such by our test), then a “high” cutoff the client’s emotional problems. Thus any moti-
score of at least 21 should be used. vation to dissimulate would have been minimal.
If we are using the BDI as a screening inventory Subsequently, however, the BDI has been admin-
to detect depression among psychiatric patients, istered in group settings where the subjects, be
A. T. Beck and Beamesderfer (1974) recommend they psychiatric patients or college students, may
13. For screening depression among medical well have different motivational attitudes and
patients, a score of 10 is recommended (Schwab, may be more likely to distort their responses.
Bialow, Brown, et al., 1967). A score of 10 has The BDI has also been criticized for its inability
also been used in studies of college students to differentiate moderate from severe levels of
(e.g., Hammen, 1980; M. Zimmerman, 1986). depression (Bech, et al., 1975).
Incidentally, M. Zimmerman (1986) found that A somewhat different concern reflects the
out of 132 introductory psychology students, 43 notion that responses may be in part a function
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 193

of the structural aspects of the test rather than scales such as the MMPI D scale and the BDI;
the content of the items. Dahlstrom, Brooks, and these are presumed to reflect the major symp-
Peterson (1990) administered three forms of the toms of depression such as feelings of loneli-
BDI to a sample of college women: the stan- ness, hopelessness, sleep disturbance, and loss of
dard form, a backwards form where the response appetite. The scale however, does not fully match
options were presented in reverse order from the DSM criteria for depression, does not distin-
most pathological to least, and a random form guish between subtypes of depression, and does
where the response options were scrambled. The not include such symptoms as suicidal ideation.
random-order BDI resulted in a significantly The scale can be self-administered, used as part
higher mean depression score (11.01) than for of a clinical interview, and even as a telephone
either the standard form (7.93) or the backwards survey.
form (6.01). The authors concluded that the stan- The respondent is asked to rate the frequency
dard BDI response format is highly susceptible of each of the 20 symptoms over the past week,
to a “position response” set, where either the first using one of four response categories, ranging
or the last option tends to be endorsed, rather from 0 (rarely or none of the time) to 3 (most or
than careful consideration being given to all four all of the time). Scores can thus range from 0 to
choices. They therefore recommended the use of 60. Four of the 20 items are worded in a positive
the random form. direction (e.g., “I was happy”), and 16 are worded
Content validity seems to be an issue also. negatively (e.g., “I felt depressed”).
The BDI items emphasize the subjective expe-
rience of depression, and it is estimated that only
Reliability. The CES-D is designed to measure
29% of the BDI score reflects a physiological
current state, and the instructions request the
factor; other scales of depression seem to have
respondent to consider only the past week.
a larger behavioral and somatic component and
In addition, depression is considered to be
may therefore be more sensitive to changes in
“episodic,” that is, the symptoms vary over time.
depression as a function of treatment (Lambert,
Therefore, test-retest reliability is expected not
Hatch, Kingston, et al., 1986).
to be very high – and indeed it is not. In the
original study, test-retest intervals of 2 to 8
Second edition. A second edition of the BDI was
weeks produced average test-retest coefficients of
published in 1996 (A. T. Beck, Steer, & Brown,
.57; greater intervals produced lower coefficients,
1996) in part to increase its content validity – i.e.,
but shorter intervals did not produce higher
a criticism of the first edition was that the items
coefficients.
did not fully cover the DSM diagnostic criteria
Internal consistency measures on the other
for depression. A study of the BDI-II concluded
hand, such as split-half and coefficient alpha, pro-
that this version shows high internal reliability
duced coefficients in the high .80s and low .90s
and factor validity (Dozois, Dobson, & Ahnberg,
(Radloff, 1977; Radloff & Teri, 1986).
1998).

Validity. The CES-D discriminates well between


Center for Epidemiologic
clinical patients and general population samples,
Studies-Depression (CES-D)
as well as within various psychiatric diagnostic
Sometimes, rather than create a new scale, inves- groups. Scores on the CES-D correlate well with
tigators look at the variety of measures that have ratings of severity of depression made by clini-
been developed to assess a particular variable, and cians familiar with the patients, as well as with
select the “best” items as a new scale. The CES-D other measures of depression. Radloff and Teri
illustrates this approach. This scale was designed (1986) reviewed studies using the CES-D with
to measure symptoms of depression in commu- the elderly and concluded that the CES-D was as
nity populations (Radloff, 1977); it is basically a good a measure of depression in older adults as in
screening inventory, designed not as a diagnostic younger adults. Both reliability and validity find-
tool but as a broad assessment device. The scale ings with the elderly were comparable with those
consists of 20 items taken from other depression obtained with younger samples. The scale has
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

194 Part Two. Dimensions of Testing

been used with a wide variety of groups, includ- reliability (Naughton & Wiklund, 1993). Studies
ing homeless persons (Wong, 2000). of the validity of the SDS have in general been pos-
The construct validity of the CES-D also seems itive, with some dissent (e.g., Blumenthal, 1975;
quite acceptable, but there is some question as to Hedlund & Vieweg, 1979). Hedlund and Vieweg
whether the CES-D measures depression, both (1979) concluded that the SDS could be used as a
depression and anxiety, or some other variable. screening tool or as an ancillary measure, but not
Roberts, Vernon, and Rhoades (1989) suggest as a diagnostic measure of depression. As with the
for example, that the scale measures “demoral- BDI, a number of studies have looked at cross-
ization,” which could be a precursor to either cultural applications of the SDS in various coun-
depression or anxiety. tries such as Finland, Germany, Iran, Italy, and
Studies of the factor structure of the CES-D Japan (e.g., deJonghe & Baneke, 1989; Horiguchi
have typically found four factors, and although & Inami, 1991; Kivela & Pahkala, 1986; Naughton
different investigators use different terms, the & Wiklund, 1993; Zung, 1969).
four factors reflect: (1) depressed affect, (2) pos-
itive affect, (3) somatic/vegetative signs, and (4) Usefulness of self-reports. There is consider-
interpersonal distress – this last factor is com- able debate about the usefulness of self-report
posed of only two items and is a “weak” fac- inventories such as the MMPI and the MCMI
tor psychometrically (Kuo, 1984; Radloff, 1977). in the diagnosis and assessment of psychiatric
Four subscale scores can thus be obtained that disorders. A number of critics have argued that
include 18 of the 20 items. psychiatric patients, because of the nature of
Like the BID, the CES-D has been translated their illnesses, are basically untestable, that is,
into a number of languages including Chinese not able to complete the inventories in a valid
(Ying, 1988) and Greek (Madianos, Gournas, & manner (e.g., F. K. Goodwin & Jamison, 1990;
Stefanis, 1992), and short forms have been devel- Walters, 1988). Others have argued that invento-
oped (Melchior, Huba, Brown, et al., 1993; Santor ries are useful and can provide valuable informa-
& Coyne, 1997). tion (e.g., Bauer, et al., 1991; Wetzler, 1989b). Cer-
tainly, the empirical data supports the usefulness
of such tests as a major source of data for clin-
The Zung Self-Rating Depression
ical assessment (e.g., Wetzler, Kahn, Strauman,
Scale (SDS)
et al., 1989; Wetzler & Marlowe, 1993).
The SDS was designed to provide a quantita-
tive, objective assessment of the subjective expe- A core battery. One important role for psycho-
rience of depression (Zung, 1965). The scale is logical tests in the area of psychopathology is to
composed of 20 items that cover affective, cogni- measure the efficacy of psychological treatments.
tive, behavioral, and psychological symptoms of Attempts have been made, especially in the assess-
depression. Respondents are asked to rate each ment of anxiety disorders, personality disorders,
item using a 4-point scale from 1 (none or a little as well as mood disorders, to delineate a core bat-
of the time) to 4 (most or all of the time), as to how tery of tests that could be used by practitioners.
it has applied to them during the past week. The This would allow comparisons of different tech-
SDS is self-administered and takes about 5 min- niques and different programs and would clar-
utes to complete. ify the communication of such results to various
The score on the SDS is calculated by summing agencies, to researchers, and even to the general
the item scores, dividing by 80, and multiplying public. For now, however, there is disagreement
by 100. Scores below 50 are in the normal range, as to whether such uniformity is desirable and/or
scores between 50 and 59 reflect mild depres- useful and what specific tests might form such a
sion, 60 to 69 marked depression, and 70 or above core battery.
extreme depression.
Although the SDS has been available for some Focal assessment. Wetzler (1989a) suggests that
time and has been used in a variety of stud- in the assessment of psychopathology there is
ies, there is relatively little reliability informa- a new movement termed focal assessment. Tra-
tion available; what is available suggests adequate ditionally, a standard battery of projective and
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

Psychopathology 195

intelligence tests was used to assess a psychiatric Elwood (1993) gives a clear example. He asks
patient. The battery often included the MMPI, us to imagine that a test to assess depression is
a WAIS, a Rorschach, and perhaps a TAT or given to 100 depressed patients and 100 normal
other projective instrument. Psychiatric rating controls. The results are as follows:
scales, structured interviews, or self-report scales
were usually relegated to research purposes. Focal Diagnosis
assessment involves the use of specialized instru- Not
ments, of instruments whose focus is much nar- Depressed depressed
rower than a broad-based inventory such as the Depressed 90 10
MMPI. This is in sharp contrast to the sugges- Test
tion made above of a broad-based battery to be results
Not 10 90
given to all patients. For now, most practitioners depressed
who use tests use a combination of the twosome,
broad-based instruments such the MMPI and/or
Using the test results, of the 100 depressed
Rorschach and some specific instruments that
patients we identify 90 correctly and misidentify
measure depression, post-traumatic stress disor-
10 (false negatives); of the 100 normal controls
der, or other condition.
we again identify 90 correctly and misidentify 10
(false positives). The predictive value for this test
is then 90/100 or 90%, a rather impressive result.
The Base Rate Problem Revisited Now imagine that the test is used to screen for
depression in a setting where the base rate for
In Chapter 3, we discussed the notion of base
depression is 10%; this means that for every 200
rates, the “naturally” occurring rate of some-
patients, 20 are in fact depressed. Administering
thing. For example, if out of 500 consecutively
the test to 200 patients would yield the following
admitted patients to a psychiatric hospital, 10
results:
were diagnosed as schizophrenic, the base rate for
schizophrenia in that particular setting would be Diagnosis
2% (10/500).
Not
In clinical psychology, the usefulness of a test Depressed depressed
is often judged by whether the test results can
Depressed 18 18
be used to classify a client as having a particular Test
diagnosis vs. some other diagnosis. This valid- results
ity is often established by comparing contrasted Not 2 162
groups – for example, depressed patients vs. non- depressed
depressed. Quite often, the two samples are of
equal size – a good research strategy designed Note that both sensitivity and specificity are
to maximize the statistical power of such pro- independent of base rate. So sensitivity stays at
cedures as analysis of variance and facilitating 90% and that is why for every 20 depressed
group matching on potentially confounding vari- patients, the test would correctly identify 18, with
ables such as age and gender. Note however, that 2 false negatives. Specificity also stays at 90%, so
when we assess whether the means of two groups of the 180 patients who are not depressed, 90%
are significantly different on Test X, the typical or 162 are true negatives and the other 18 are
procedures used, such a t tests or ANOVA, do false positive. Notice however, what happens to
not indicate whether the size of the difference the predictive power. It now becomes 18/36 (true
between two means is large enough for clinical positives/true positives + false positives) or 50%.
use. Such an answer is provided by the analysis In effect, we could get the same results by flipping
of false positives, false negatives, and hits or, as a coin! The solution, of course, is to calculate local
we indicated in Chapter 3, of sensitivity (the true base rates and pay attention to them, a point that
positive rate) and specificity (the true negative has been made repeatedly and forcefully (Meehl
rate). Unfortunately such an analysis is severely and Rosen, 1955) and, as we saw, was incorpo-
influenced by the base rate. rated in the MCMI.
P1: JZP
0521861810c07 CB1038/Domino 0 521 86181 0 February 24, 2006 14:24

196 Part Two. Dimensions of Testing

SUMMARY There are three popular depression scales in the psychi-


atric literature: The Zung, the BDI, and the Hamilton. The
We have taken a brief look at a variety of instru- first two, discussed in this chapter, are self-rating scales;
ments, and through them, at issues that underlie the Hamilton is completed by the interviewer or clinician.
The authors located 85 studies that compared at least two of
the area of testing for psychopathology. We have
the three measures and analyzed 36 of these studies through
seen screening inventories, multivariate tests, and the technique of metaanalysis (explained quite clearly in this
measures of specific dimensions such as anxiety article).
and depression. In the past, a major issue has
Reynolds, W. M., & Kobak, K. A. (1995). Reliability
always been criterion validity, and diagnosis was and validity of the Hamilton Depression Inventory: A
seen as not highly reliable, and therefore not a paper-and-pencil version of the Hamilton Depression
valid, even though it was a necessary, criterion. Rating Scale clinical interview. Psychological Assess-
As the focus has changed more to construct valid- ment, 7, 472–483.
ity, the issue has focused more on the sensitivity A very thorough review of a paper-and-pencil version of a
and predictive power of a test. At the same time, depression scale that originally was a semistructured inter-
we should not lose sight of the fact that a test, in view measure.
the hands of a well-trained and sensitive clinician, Watson, C. G. (1990). Psychometric posttraumatic
is much more than a simple diagnostic tool. stress disorder measurement techniques: A review. Psy-
chological Assessment, 2, 460–469.
A review of 12 measures of posttraumatic stress disorder,
SUGGESTED READINGS including a scale developed on the MMPI. The author covers
Elwood, R. W. (1993). Psychological tests and clini- in detail issues of reliability and validity, as well as the utility
of such scales.
cal discriminations: Beginning to address the base rate
problem. Clinical Psychology Review, 13, 409–419.
We used the arguments and examples given by this author DISCUSSION QUESTIONS
in our discussion of base rates. This article is a very clear
exposition of base rates as they affect clinical assessment. 1. If a test such as the SCL-90R is given to a
patient, how do we know that the results are valid?
Helmes, E., & Reddon, J. R. (1993). A perspective on
developments in assessing psychopathology: A critical 2. The president of the university decides that
review of the MMPI and MMPI-2. Psychological Bul- all new students are to be given the PSI. What
letin, 113, 453–471. are some of the ethical concerns of such a
The authors perceive the MMPI to be an outmoded instru- procedure?
ment and severely criticize it. Although one may not agree 3. Why do you think the MMPI continues to be
with all of the criticisms and the forcefulness with which they used widely?
are stated, this is a well-written review article that focuses on
major aspects of test construction. 4. What are some of the “unique” aspects of the
MCMI?
Lambert, M. J., Hatch, D. R., Kingston, M. D., &
Edwards, B. C. (1986). Zung, Beck, and Hamilton Rat- 5. How would you differentiate between state
ing Scales as measures of treatment outcome: A meta- and trait anxiety?
analytic comparison. Journal of Consulting and Clinical Would the same distinction apply to other vari-
Psychology, 54, 54–59. ables – for example, depression?
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

8 Normal Positive Functioning

AIM This chapter looks at a variety of areas that reflect normal positive functioning.
The chapter is not intended to be a comprehensive review of normality; it covers a
small number of selected areas chosen either because of their importance in psycho-
logical testing, or because of some illustrative innovative aspect, and perhaps because
of our feeling that some of these areas, although important, are often neglected by
instructors. Much of psychological testing has developed within a clinical tradition,
with the emphasis on psychopathology. As we saw in Chapter 7, psychologists have
developed some fairly sophisticated measures of psychopathology; even intelligence
testing covered in Chapter 5, developed originally within the context of assessing
retarded children. The assessment of normality has in many ways been neglected, pri-
marily because assessment occurs where there is a need – and the need to “measure”
what is normal has not, in the past, been very strong. Keep in mind also that the
dividing line between normality and abnormality is not absolute, and so tests of psy-
chopathology such as the MMPI can also be used with presumably mentally healthy
college students.

SELF-CONCEPT initially emphasized a general or unitary self-


concept. Recently it has focused on the multi-
Perhaps a first question about normal function- dimensionality of the self-concept. For our pur-
ing has to do with a person’s self-concept. How poses, we use the term “self-concept” to include
do you feel about yourself? Do you like yourself? both global concept and specific dimensions.
Do you have confidence in your abilities? Do you
perceive yourself as being of value and worth?
Or are you doubtful about your own worth, do
The Tennessee Self-Concept Scale
you have little confidence in yourself and often
feel unhappy about yourself? This is the issue of One of the better well-known scales of self-
self-esteem and/or self-concept. A person’s self- concept is the Tennessee Self-Concept Scale or
concept is that person’s view of self, and it is highly TSCS (Fitts, 1965); a 1981 bibliography con-
related to a wide variety of behaviors. Other terms tained some 1,350 relevant references (P. F. Reed,
such as self-esteem and self-image are used, and Fitts, & Boehm, 1981). The TSCS consists of 100
authors argue about the differences and simi- self-descriptive items, such as “I am a happy per-
larities. Some differentiate between various self- son” and “I do what is right,” by means of which
combinations, and others use all the terms syn- an individual indicates what he or she likes, feels,
onymously. Paralleling the development of intel- does, and so on. The scale is basically designed to
ligence measurement, as well as the measurement assess a person’s self-image, and how realistic or
of many other variables, self-concept research deviant it is. Each item has five response options

197
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

198 Part Two. Dimensions of Testing

Physical Moral-Ethical Personal Family Social


self Self self self self

6 items
etc.
Identity per cell

Self-satisfaction

Behavior

FIGURE 8–1. Two-dimensional schema underlying the TSCS.

ranging from “completely false” to “completely the MMPI. Note that, in general, the develop-
true.” ment of this scale follows the pattern outlined in
Chapter 2.
Development. The TSCS was developed in 1955
and the first step was to compile a pool of Description. The TSCS is self-administered,
self-descriptive items. These were obtained from either individually or in groups, can be used with
other self-concept scales, as well as written self- individuals 12 years or older, and requires about a
descriptions of mental-health patients and of sixth-grade reading level. Typically it takes about
nonpatients. These items were then classified 15 minutes to complete the scale, which can then
into a two-dimensional schema, very much like be scored by hand or machine. There are two
the taxonomy approach used in content valid- forms of the scale, although both forms use the
ity. Although the two dimensions are not named same test booklet and test items. The Counsel-
explicitly, other than “internal” and “external,” ing form is quicker and easier to score, and the
one dimension consists of five external aspects of results can be used by the client directly, while
the self such as physical self, moral-ethical self, the Clinical/Research form is more complicated
and family self; the second dimension consists in terms of scoring, obtained results, analysis, and
of three internal aspects of functioning, namely interpretation.
identity (what the person is), self-satisfaction There is a reusable test booklet and a consum-
(how the person accepts himself or herself), and able answer sheet, as well as a consumable test
behavior (how the person acts). Identity can be booklet for use in computerized scoring. One of
interpreted as the internal, private, aspect of the the “interesting” aspects of this test is that the
self; behavior is the manifestation of the self that items in the booklet are not listed in the usual
is observable to others, and satisfaction can be numerical order. For example, item 1 is followed
reframed as the discrepancy between the actual by items 3, 5, 19, 21, etc., and the answer sheet
self and the ideal self. Figure 8.1 illustrates the matches (somewhat) this numerical progression.
schema. This is done so the answer marks go through a
Items were judged by seven clinical psycholo- carbon paper, onto a scoring sheet in appropri-
gists as to where they belonged in this schema, ate proximity for each scale. Otherwise the items
and whether they were positive or negative. The would either have to be rearranged making their
final 90 items retained for the scale are those intent even more transparent, or the responses
where there was perfect agreement on the part would need to be recopied, or many scoring tem-
of the judges; they are equally divided as to pos- plates would be needed. (There is of course a sim-
itive and negative items, with six items for each pler solution, and that is to renumber the items in
intersection of the two dimensions. An additional sequence.) Subjects are asked to indicate at what
10 items were “borrowed” from the L scale of time they began the test and at what time they
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 199

finished. Although this is not a timed test, the framework used to assign items is of course where
amount of time is used as a score. the focus of content validity should be, and in this
case we might well conclude that what is covered
Scoring. The TSCS yields a rather large number in this test is comprehensive but not exhaustive.
of scores, 46 total (although only 29 find their For example, we might argue that “academic”
way on the profile sheet), and the impression self-concept should be included.
one gets is that every possible combination of Concurrent validity is shown by comparison
items is in fact scored! There is for example, a of TSCS scores for various groups – psychiatric
“Self-Criticism” score, which is the sum of the patients, the normative group, and a group of
10 MMPI items. There is a “Total Positive” score “well-integrated” individuals. Other studies are
that is a sum of all the 90 items and presumably cited that found significant differences on TSCS
reflects overall level of self-esteem. The Total Pos- scores between delinquents and nondelinquents,
itive score is broken down into eight components, between first offenders and repeat offenders,
according to each of the eight categories that unwed mothers and controls, and alcoholics and
make up the two-dimensional schema. There is a controls.
“Variability” score (actually subdivided into three Criterion validity data is presented based on
scores) that assesses the amount of variability or studies of psychiatric patients with the MMPI,
inconsistency from one area of self-perception to high-school students with the Edwards Personal
another. There are also six empirically derived Preference Schedule, and various other mea-
scales, obtained through various comparisons of sures with college students and other samples.
a normal group with psychiatric patients, as well In general, although there are some noticeable
as other scores too numerous to mention here. exceptions, the pattern of correlations supports
As you might imagine, hand scoring is quite time the criterion validity of the TSCS. Note, how-
consuming, although the test manual is fairly ever, that basically the results would be what we
clear in the scoring directions given. would expect if the TSCS were a measure of gen-
eral adjustment, psychological health, or simi-
Reliability. Test-retest reliability on a sample of lar global variable. None of these data exclusively
60 college students retested over a 2-week period, support the notion that the TSCS measures self-
yielded coefficients ranging from .60 to .92, with concept.
most of the coefficients in the acceptable range. Finally, construct validity is addressed through
The primary scale, the “total positive,” yielded an a number of studies that hypothesize that positive
r of .92. experiences such as psychotherapy should result
in an enhancement of self-concept, while neg-
Validity. The test manual (Fitts, 1965) discusses ative experiences such as stress or failure result
four types of validity: (1) content validity, (2) in lowered self-concept. For example, a study of
discrimination between groups (i.e., concurrent paratrooper trainees is cited in which the trainees
validity), (3) correlation with other personal- underwent physical danger as well as “attitude
ity measures (i.e., criterion validity), and (4) training,” where failure was considered a dis-
personality changes under particular conditions grace. The trainees were administered the TSCS
(i.e., construct validity). Notice that the last two both before and after training, and some trainees
categories implicitly suggest that the TSCS is a passed while some failed the program. However,
personality test. both pass and fail groups showed “significant
Content validity is incorrectly interpreted as score decreases,” so it is moot as to whether these
interrater reliability. That is, Fitts (1965) argues results support the validity of the TSCS – we could
that the TSCS has content validity because the argue that the brutal training resulted in lowered
judges agreed in their placement of the retained self-esteem for all, or that the pass group should
items, and therefore the scales of the TSCS are have shown increased self-concept.
“logically meaningful and publicly communica-
ble.” As we learned in Chapter 3, the issue of con- Norms. The original norms were based on a sam-
tent validity is whether the test adequately covers ple of 626 individuals. Little information is given
the variable to be assessed. The two-dimensional as to the characteristics of these people other than
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

200 Part Two. Dimensions of Testing

to indicate that the sample is very heterogeneous support for the factors postulated by Fitts, and at
in age, gender, socioeconomic level, and educa- best there were only two to four dimensions in
tion. The sample is clearly not random or repre- the TSCS. The conclusions were that scoring the
sentative (as might be obtained through census TSCS according to directions would only “mis-
data), so we must conclude that this is a sample of guide the user” and lead to interpretations that
convenience. The author (Fitts, 1965) does argue “are simply not warranted,” and that the TSCS
that large samples from other populations do not is “clearly inadequate.” Perhaps we can summa-
differ appreciably from his norms – that is, the rize by paraphrasing Marsh and Richards (1988)
norms are representative (even though he indi- who indicated that in the 1960s the TSCS perhaps
cates that the norm group has an excess of college represented one of the best self-concept instru-
students). He also argues that there is no need to ments but, as judged by current test standards, it
establish separate norms by age, gender, race, or is a weak instrument.
other variables. We of course would want more
than simply the assurance of an author. We would Primary, secondary, and tertiary validity revis-
want to look at actual score distributions for sepa- ited. In Chapter 3, we discussed a conceptual
rate groups and assure ourselves that indeed they model to organize information about a specific
are identical; or at the very least we would want test. You recall that the model had three steps that
the results of a statistical test such as chi-square we called primary, secondary, and tertiary valid-
to show that the two distributions are not signif- ity. This model is not used widely in the litera-
icantly different from each other. ture, despite the fact that it provides a very useful
framework for the practitioner who is seriously
Intercorrelations of scale scores. Because of interested in learning about and mastering a spe-
item overlap among scales and because some cific test. The topic of self-concept allows a useful
scales represent a subtotal of another scale, illustration of this framework.
obtained correlations among scale scores are Years ago, I (G. Domino) carried out a study to
spuriously high. On the other hand, the major assess the effectiveness of a television campaign
dimensions of self-perception (i.e., self-esteem, designed to lower drug abuse among adolescents
self-criticism, variability, certainty, and conflict) (G. Domino, 1982). In preparation for that study,
are all relatively independent of each other. we found a 50-item self-concept questionnaire,
called the Self-Esteem Questionnaire (SEQ) in
Criticisms. It would seem that self-concept the appendix of a drug-education text. Neither
would be a valuable dimension to study within the publisher nor the editor could provide any
the context of normal functioning, and indeed information whatsoever about the questionnaire.
it is. It is somewhat curious then, that much of We nevertheless used it in our study and found
the focus of the TSCS is based on clinical patients it a relatively interesting measure. We therefore
undergoing psychotherapy, who may be lacking decided to do a series of programmatic studies
in self-esteem or have distorted self-images. In to generate the type of information needed to
fact, we could have easily discussed the TSCS in evaluate the reliability and validity of this mea-
Chapter 4 under the topic of personality or in sure, using the tripartite conceptual model (G.
Chapter 7 under the topic of psychopathology. Domino & Blumberg, 1987). Table 8.1 gives some
Despite the fact that this is a commonly used illustrative examples of SEQ items.
self-concept scale, the TSCS has received substan-
Table 8–1. Illustrative SEQ Items
tial criticisms: it is open to social desirability and
other response sets, the results of factor-analytic I usually feel inferior to others.
I normally feel warm and happy toward myself.
studies do not support its hypothesized dimen-
I often feel inadequate to handle new
sions, and the reliability data are considered inap- situations.
propriate and inadequate (e.g., P. Bentler, 1972; I usually feel warm and friendly toward all I
Hoffman & Gellen, 1983; Wylie, 1974). Tzeng contact.
Maxey, Fortier, et al. (1985) on the basis of sev- I habitually condemn myself for my mistakes
and shortcomings.
eral factor analyses, found that although reliabil-
I am free of shame, blame, guilt, and remorse.
ity indices were “exceedingly high,” there was no
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 201

Table 8–2. Mean and SD T Scores for Six establish primary validity, but the above is a good
Groups Presumed to Differ on Self-Esteem beginning.
Group Mean SD
Secondary validity. Secondary validity involves
Student leaders 62.3 5.7
a clarification of the underlying dimension of
Varsity athletes 57.1 5.0
Intro Psych students 51.4 7.3 measurement through four steps: (1) a review
Counseling Center clients 46.3 5.1 of the development of the test, (2) an analysis of
“Problem” students as 43.4 4.8 test items with respect to format and content, (3)
identified by Dean an analysis of the relationship between the test
Students on academic probation 41.4 5.4
and other important variables, and (4) a study
of individuals whose scores are diagnostically
significant.
The first question is that of reliability, not
As for step one, no information is available
part of the model because the model addresses
on how this test came to be. A perusal of the
only the issue of validity. For a sample of college
items suggests that the author(s) had a “human-
psychology students a test-retest with a 10-week
istic” bent, but that is pure speculation. For step
interval yielded an r = .76. For a sample of high-
two, a content analysis of the items suggests that
school students, a split-half reliability yielded a
they cover the social and emotional aspects of
corrected coefficient of .81, and an internal con-
self-concept, but not other aspects such as cog-
sistency analysis, a Cronbach’s alpha of .52. These
nitive and academic components. A factor anal-
results suggested a modicum of reliability, and a
ysis based on the protocols of 453 students indi-
possible hypothesis that the instrument was not
cated three major factors: a general factor that
homogeneous.
accounted for 62% of the variance, a specific fac-
tor suggesting neurotic defensiveness (11% vari-
Primary validity. Remember that the task here is ance), and a third smaller factor relating to inter-
to determine how well the test measures what it personal competence. Thus a major limitation of
purports to measure – in this case self-esteem. A this inventory is its unknown conceptual under-
good beginning is to determine whether the mean pinnings and limited content validity.
scores for various groups for whom we would the- For step three, a series of studies indicated no
oretically expect a difference, are in fact different. gender differences, no ethnic or racial differences,
If a test cannot differentiate at the group level, it no significant correlations with socioeconomic
certainly would not be very useful at the individ- status or measures of social desirability, no sig-
ual level. Six student groups were selected, rang- nificant correlations with intelligence-test scores,
ing from student leaders for whom self-esteem but significant correlations with GPA in both col-
should be highest to students on academic pro- lege and high-school students. In addition, scores
bation, for whom self-esteem should be lowest. on this scale correlated significantly with scores
The results are presented in Table 8.2. Note that on six other self-concept measures, with corre-
the group means form a nice progression along lation coefficients ranging from .38 to .73. Of
what might be called a sociological continuum. course, this is the kind of data one would expect
That is, on the basis of sociological theory, we to find in a test manual or in the professional
hypothesized that the groups should occupy dif- literature.
ferent positions on this continuum, and indeed The fourth step is to look at high- and low-
the results support this. Note that the means and scoring individuals. That is, if Kathryn obtains
SDs are expressed as T scores, with an expected a T score of 65 on this scale, what else can
mean of 50 and a SD of 10. The top three groups we say about her, other than that she has
can be said to be above the mean on self-esteem, high self-esteem? To obtain data to answer this
and the bottom three groups below the mean. question, high-scoring and low-scoring students
A biserial correlation between scores on the were interviewed and the interviews observed by
SEQ and the dichotomy of higher versus lower 12 clinical-psychology graduate students. Both
status yielded an r = .59 (p < .001). Much interviewer and observer were blind as to the
more evidence would of course be needed to student’s score on the SEQ. At the end of the
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

202 Part Two. Dimensions of Testing

Table 8–3. ACL and Q-Sort Items Descriptive of High vs. however, provide some addi-
Low Self-Esteem tional information beyond
ACL items Q sort items
assessment of self-esteem? In a
sample of alcoholics undergoing
High self-esteem
therapy, scores on the SEQ
Active Friendly Has a wide range of interests. correlated with staff ratings of
Adaptable Healthy Initiates humor. improvement; thus the SEQ
Ambitious Humorous Is productive; gets things done.
Assertive Intelligent Is calm, relaxed in manner.
might have some potential use
Calm Natural Has insights into own motives and in studies of the psychother-
behavior. apeutic process. In another
Capable Self-confident Feels satisfied with self. study, two samples of students
Confident Sociable Has social poise and presence. were assessed. Both samples
Energetic Talkative Values own independence and
autonomy.
scored high on a battery of
tests of creativity, but one
Low self-esteem
sample showed evidence of
Anxious Interests narrow Tends to be self-defensive. actual creative achievement,
Awkward Shy Seeks reassurance from others. the other did not. One of the
Distractible Timid Judges self and others in
significant differences between
conventional terms.
Immature Weak Is basically anxious. the two groups was that the
Inhibited Withdrawn Compares self to others. productive students showed a
Does not vary roles. higher mean on self-esteem.
Thus the SEQ might be of
interest to investigators concerned with creative
interviews, the observers evaluated each of the achievement. Finally, in a sample of male adult
students by completing an Adjective Checklist professionals, SEQ scores were correlated with
(a list of 300 words which the observer checks a measure of psychological femininity. Thus,
if descriptive of the client; see section below on although there seem to be no gender differences
creativity), and doing a Q sort (sorting a set of on this scale, the SEQ might be relevant to
descriptive statements according to the degree studies of androgyny and related aspects.
that they characterize the client; see Chapter 18).
Table 8.3 indicates which adjectives and which
Q-sort statements were used more frequently to LOCUS OF CONTROL
characterize high self-esteem students and low One of the major themes of both human and
self-esteem students. animal behavior is that of control, that is contin-
Note that the portraits presented here of these ued attempts to deal effectively with the environ-
two types of students are internally consistent, ment. The experience of achieving mastery over
i.e., they make sense, they are coherent. High self- oneself and surrounding circumstances is one of
esteem subjects are perceived as confident and the most fundamental aspects of human experi-
productive, as able to relate well interpersonally ence. There is in fact a voluminous body of litera-
and to behave in a calm yet active manner. Low ture in psychology on this topic, and a number of
self-esteem subjects are seen as timid and con- experiments have become “classics” that are cited
ventional, with an almost neurotic need for reas- in introductory psychology textbooks. For exam-
surance. ple, Stotland and Blumenthal (1964) showed that
humans made to feel in control in a testing
Tertiary validity. This step involves the justifica- situation tended to be less anxious than those
tion for using such a measure as the SEQ. For who did not have this belief. Seligman (1975),
example, if you were not interested in assessing showed that dogs exhibited “helpless” behavior
self-esteem, why pay attention to the SEQ? We when exposed to conditions they could not con-
could of course argue that self-esteem is such trol. Rotter (1966) hypothesized that the degree
a basic variable that it probably is relevant to to which a person perceives rewards to be con-
almost any psychological inquiry. Could the SEQ tingent upon their own efforts vs. controlled by
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 203

others is an important dimension. He identified need to consult the original doctoral disserta-
belief in internal control, as the perception that tions for specific details, Rotter (1966) provides
rewards are contingent upon one’s behavior, and a rather extensive amount of information on the
external control, as the perception that rewards initial reliability and validity of the I-E scale. In
are under the control of powerful others, of addition, internal vs. external control of rein-
luck and chance, or unpredictable. This hypoth- forcement, often referred to as “locus of control”
esis was presented within the context of social- is probably one of the most studied variables in
learning theory, where rewards or reinforcements psychology, and numerous scales of locus of con-
act to strengthen an expectancy that a particu- trol are available. In fact, the area is so prolific that
lar behavior will be followed by that reinforce- there are many reviews (e.g., Joe, 1971; Lefcourt,
ment in the future. Thus internal-external con- 1966; Strickland, 1989; Throop & MacDonald,
trol of reinforcement is a generalized expectancy 1971), and entire books devoted to the topic (e.g.,
that there is more or less a connection between Lefcourt, 1976; Phares, 1976).
behavior and the occurrence of rewards; this is a Rotter (1966) reported corrected split-half
continuum rather than a dichotomy. reliabilities of .65 for males and .79 for females,
and Kuder-Richardson coefficients for various
samples in the .69 to .76 range. Rotter (1966) felt
The Internal-External Locus of
that the nature of the scale (brief, forced-choice,
Control Scale
and composed of items covering a variety of situ-
Development. The Internal-External Locus of ations) resulted in underestimates of its internal
Control (I-E) scale was thus developed by Rot- consistency. Test-retest reliability in various sam-
ter to operationalize his hypothesis. The I-E ples, with 1- and 2-month intervals, ranged from
scale began as a set of 26 items, using a Likert- .49 to .83.
type response scale, and developed on a priori Correlations with a measure of social desir-
grounds – that is, the items were written to reflect ability ranged from −.17 to −.35, with a median
the theoretical literature and to be used as is. This of −.22, and correlations with various measures
scale was used and further refined in two doc- of intelligence essentially were insignificant. Rot-
toral dissertations by students of Rotter, was then ter (1966) also reported briefly two factor analy-
expanded to 60 items, and then through a series of ses, both of which suggested one general factor. A
studies, was reduced to a 29-item forced-choice number of other studies are presented addressing
scale, that includes 6 filler items. The I-E scale the construct validity of the scale, such as correla-
then presents the respondent with sets of items tions with story-completion and semistructured
from which the respondent selects the one that he interview measures of locus of control, analyses
she most believes in. For example: (a) Becoming of social-class differences, and controlled labora-
a success is a matter of hard work, or (b) Getting tory tasks.
a good job depends on being at the right place The literature is replete with hundreds of stud-
at the right time. For each pair, one statement ies that support the construct validity of the scale
represents an internal locus of control and the and of the concept. Locus of control scores are
matching statement an external locus of control. related to a wide variety of behaviors such as aca-
The score is the total number of external choices. demic achievement (the more internal the ori-
Originally, the I-E scale was intended to pro- entation the higher the achievement; e.g., Bar-
vide subscale scores in a variety of life areas such Tal & Bar-Zohar, 1977) and various aspects of
as social interactions, political affairs, and aca- problem solving (e.g., Lefcourt, 1976). A num-
demic achievement; however, the subscales were ber of studies support the hypothesis that inter-
all yielding similar results so it was decided to nals show more initiative and effort in control-
abandon this effort and measure a single, overall ling both the physical environment and their own
expectancy. impulses (e.g., Joe, 1971).

Reliability and validity. Although the details Popularity of concept. The topic of locus of
presented by Rotter (1966) on the development control has proven to be immensely popular,
of the scale are somewhat sketchy, and one would not only in the United States but also in a
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

204 Part Two. Dimensions of Testing

cross-cultural context (e.g., Dyal, 1983; Furnham Given the proliferation of locus-of-control
& Henry, 1980; Tyler, Dhawan, & Sinha, 1989; Zea scales, one may well question whether they are
& Tyler, 1994). The concept has been applied to all measuring the same variable. Furnham (1987)
a wide variety of endeavors ranging from beliefs administered seven such scales to a sample of
about the afterlife (e.g., Berman & Hays, 1973), British adolescents. Although a content analysis
to educational settings (e.g., Weiner, 1980), to of the scales indicated very little overlap – i.e., a
behavior in organizations (e.g., Spector, 1982), look at the items indicated that the scales used dif-
and even dental health (e.g., Ludenia & Dunham, ferent items. The correlations between five of the
1983). The scale also gave birth to a large num- scales (all measuring locus of control in children)
ber of additional locus-of-control scales, some for were highly significant, and nearly all greater than
children (e.g., DeMinzi, 1990; Nowicki & Strick- .50. Their reliabilities (alpha coefficients) were
land, 1973), some for particular diseases (e.g., low and ranged from .33 to .60. However, Furn-
Ferraro, Price, Desmond et al., 1987), some mul- ham (1987) correctly interpreted these results not
tidimensional (e.g., Coan, Fairchild, & Dobyns, as lack of reliability, but reflective of the multidi-
1973; TerKuile, Linssen, & Spinhoven, 1993; K. A. mensionality of the scales.
Wallston, B. S. Wallston, & DeVellis, 1978), some
for specific arenas of behavior (e.g., Spector, 1988;
SEXUALITY
B. S. Wallston, K. A. Wallston, Kaplan, & Maides,
1976), some brief versions of other scales (e.g., As Wiederman and Allgeier (1993) state, “Sexu-
Sapp & Harrod, 1993), and some competing with ality is a vital part of being human.” Sexuality
the Rotter (e.g., Nowicki & Duke, 1974). Leven- covers a rather wide variety of variables, such
son (1973; 1974) suggested that locus of control as premarital intercourse, attitudes toward vir-
was a multidimensional concept and developed ginity, machismo, pornography, medical condi-
the Multidimensional Locus of Control question- tions, religious and philosophical views, homo-
naire composed of three subscales: Internality, sexuality, and so on. All types of scales have
Powerful Others, and Chance. This scale also been developed in this area, ranging from atti-
proved quite popular. Although most of these tude scales toward the use of condoms (e.g.,
scales have been developed on the basis of empiri- I. S. Brown, 1984) and masturbation (e.g.,
cal and theoretical guidelines (discussed in Chap- Abramson & Mosher, 1975) to sexual knowl-
ter 2), not all have. For example, J. M. Schneider edge questionnaires (e.g., Gough, 1974; Moracco
and Parsons (1970), decided, on the basis of a & Zeidan, 1982). From a psychometric point of
logical analysis, that the Rotter I-E scale actu- view, this is a very broad and ill-defined area,
ally contained five subscales. They asked judges and so the examples below cannot be taken as
to sort the items into five unspecified categories representative in the same way that the Stanford-
and found high interrater agreement. The cat- Binet and the Wechsler tests “represent” intelli-
egories were then named “general luck or fate,” gence tests.
“respect,” “politics,” “academics and leadership,”
and “success.” The sexuality scale. Snell and Papini (1989)
developed the Sexuality Scale (SS) to measure
Criticisms. When the I-E scale was first devel- what people think and how they feel about their
oped, a typical approach was to administer the own sexuality. The SS consists of three subscales
scale to a sample and use a median split to obtain labeled sexual-esteem (e.g., “I am good at sex”),
groups that would be called internals and exter- sexual-depression (e.g., “I feel depressed about
nals. At that time, the median (or mean) for my sex life”), and sexual-preoccupation (e.g.,
college-student samples was typically around 8. “I think about sex constantly”). Sexual-esteem
In recent studies, the mean has increased by about was conceptualized as the capacity to experi-
0.5 to 1 SD (typical SD runs around 4), to a ence one’s sexuality in a satisfying and enjoy-
median of 10 to 12. This means, in effect, that able way. Sexual-depression reflects a tendency
a score of 9 might have been considered an exter- to feel depressed or discouraged about one’s
nal score in earlier research, but an internal score capability to relate sexually to another person.
in recent research. Sexual-preoccupation is the persistent tendency
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 205

to become absorbed and even obsessed with sex- Christensen and Carpenter (1962). These inves-
ual matters. tigators were interested in exploring the relation-
Ten items were originally written for each of ship of premarital pregnancy to possible conse-
the three subscales, and administered to a sample quences such as having to get married, giving the
of undergraduate college students. Respondents baby up for adoption, etc., as a function of cul-
indicated degree of agreement with each item on tural differences, specifically the “sexually restric-
a 5-point scale, where agree equaled +2 and dis- tive” Mormon culture of Utah, the more typical
agree equaled –2. Half of the items in each sub- United States culture, and the sexually permissive
scale were reverse keyed. A factor analysis indi- culture of Denmark.
cated that the three-factor model was reasonable, The authors attempted to develop a Guttman
but two of the sexual-depression items were elim- type scale to assess “intimacy permissiveness,”
inated because of small factor loadings. that is how permissive or restrictive a person’s
attitudes are toward premarital sexual intimacy.
Validity. Snell and Papini (1989) found sig- They began with 21 items but eventually reduced
nificantly higher levels of sexual-preoccupation these to 10, after some statistical procedures were
for men than for women, but no gender dif- carried out to test for unidimensionality. The 10
ferences in sexual-esteem or sexual-depression, items cover the desirability of marrying a virgin,
while Wiederman and Allgeier (1993) found petting, premarital intercourse, premarital preg-
men to score higher on both sexual-esteem and nancy, and freedom of access to erotic literature.
sexual-preoccupation. A basic question that can Presumably someone who has a permissive atti-
be asked here is whether each of the subscales tude toward premarital intercourse would also
measures a variable related to sexuality or related have a permissive attitude toward premarital pet-
to more general functioning – that is, is the ting. The subject is required to check each item
sexual-depression scale a measure of more global with which he or she agrees, so scores can go
depression? In their study, Wiederman and All- from 0 to 10, with higher scores indicating greater
geier (1993) administered the SS scale, the Rosen- permissiveness.
berg Self-Esteem scale, and the Beck Depression You recall that in the Guttman scaling, the
Inventory to a sample of undergraduate college coefficient of reproducibility is the important sta-
students. They applied a special type of factor tistical procedure. The authors report such coef-
analysis and reduced the 30-item scale to 15 items; ficients as ranging from .90 to .96 (these are basi-
that is, they constructed a short form that was cally reliability coefficients). In terms of validity,
more reliable and that correlated highly with the the authors present three lines of evidence. The
original subscales. They found that global self- mean for the Danish sample (8.3) is substantially
esteem was only moderately correlated to sexual- higher than the mean for the U.S. sample (4.1),
esteem, and that depression was only moderately and the mean for the Mormon sample (2.4).
correlated with sexual-depression. Secondly, males have higher mean scores than
females, in all three samples. Finally, the higher
A Guttman scale. In Chapter 6, we briefly dis- the intimacy-permissiveness score the larger the
cussed Guttman scales as a way of measur- percent having premarital intercourse, again in
ing attitudes. You recall that these are unidi- all three samples. Unfortunately, the authors give
mensional scales created in such a way that no detail on how the original items were devel-
a person’s position on the psychological con- oped or on the precise statistical procedures used.
tinuum being measured is represented by the
point at which the responses of the individual
CREATIVITY
shift from one category to another (for exam-
ple, from agree to disagree). Guttman scales Creativity has been of interest to psychologists
are relatively rare, but have found particular for quite some time, but a serious effort to study
application in the area of sexuality because sex- creativity and to develop measures of creativity
ual behavior and sexual intimacy seem to fol- did not begin until the 1950s, in the work of
low a particular progression. One such scale is Guilford at the University of Southern Califor-
that of “intimacy-permissiveness” developed by nia and the work of various psychologists at the
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

206 Part Two. Dimensions of Testing

Institute of Personality Assessment and Research A second perspective is to study the creative
of the University of California at Berkeley. There process, that is, what happens when that inner
is at present no clear consensus as to precisely light bulb goes on, what happens as a painter cre-
what creativity is and what its specific compo- ates a painting or a poet writes a poem? The cre-
nents are. It is therefore not too surprising that ative process is typically dissected into four stages:
different investigators use different measures and (1) preparation, where information is gathered,
that measures of “creativity,” aiming at different various solutions are attempted, the challenge
subsets of abilities, often do not correlate highly may be rephrased; (2) incubation or a turning
with each other. away from the problem. The person might go for
Torrance (1966), one of the leading researchers a walk, take a bath, or sleep on it; (3) illumination,
in this area, defines creativity as “a process of the “aha” experience where solutions emerge,
becoming sensitive to problems, deficiencies, often quite complete, detailed, and visual; (4)
gaps in knowledge, missing elements, dishar- verification, where the potential solutions are
monies, and so on; identifying the difficulty; tested and elaborated. This four-stage process
searching for solutions, making guesses, or for- was proposed by Wallas (1926) and still seems
mulating hypotheses about the deficiencies; test- to be quite applicable, although others have pro-
ing and retesting these hypotheses and possibly posed more elaborate versions (e.g., Rossman,
modifying and retesting them; and finally com- 1931).
municating the results.” A third approach is to focus on the creative
It is customary to distinguish between intelli- product itself. What distinguishes creative from
gence and creativity, and this theoretical distinc- pedestrian paintings? Are there certain qualities
tion is mirrored in the respective tests. Intelli- of balance, of asymmetry, of form and motion,
gence tests tend to require convergent thinking, that are part-and-parcel of a creative product?
that is, coming up with the best single answer, Much of the work in this area has been done by
while tests of creativity tend to require divergent artists, art critics, philosophers, and educators,
thinking, that is multiple answers, all of which rather than psychologists.
are “uniquely correct.” Thus, tests of intelligence Finally there is a fourth perspective that we
often make use of vocabulary items, facility in might label press (to keep our alliteration). Cre-
solving mathematical problems, reading com- ative press refers to the press or force of the
prehension, and spatial visualization. Tests of cre- environment on creativity – both the inner psy-
ativity typically require imagination, generation chological environment and the outer physical
of ideas, asking unusual questions, and coming environment. Here we might be concerned about
up with novel responses. what motivates a person to create or how the
physical environment can promote or dampen
creativity.
Different perspectives. There are probably at
least four ways in which we can study creativ-
Torrance Test of Creative Thinking (TTCT)
ity. The first is to focus on creative persons. Sup-
pose we were able to identify a group of highly There are hundreds of measures of creativity
creative individuals, we could ask whether they that have been proposed in the literature, but
differed on intelligence, personality, motivation, most have been presented without the required
food preferences, and so on, from their less cre- evidence for their basic reliability and validity.
ative peers. Psychologists have been quite prolific One instrument that has proven quite popular,
in their study of such persons, and a wide vari- and in many ways represents a reliable and valid
ety of groups have been assessed such as writers, approach, is the TTCT, which is actually a bat-
architects, mathematicians, mothers of creative tery of tests (somewhat like the Wechsler tests),
adolescents, and so on. The results suggest that containing seven verbal subtests and three figural
creative persons do differ from their less creative subtests.
peers in a number of significant ways that tran- The TTCT was developed by Torrance in 1966,
scend field of enterprise (see Barron, 1969; G. A. and was intended to measure “creative think-
Davis, 1986). ing abilities” rather than creativity. The verbal
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 207

subtests include an “unusual uses” test (“think where they originated as a framework. Fluency
of unusual uses for a box”), a “guessing causes” is often translated into the number of acceptable
subtest based on a picture to which the exam- responses given. Flexibility reflects the number
inee generates questions about cause and effect of categories of responses. For example, the word
relationships that are not explicitly expressed in “ball” can be defined as a round/oblong object as
the picture, and a “product improvement” sub- in baseball or football, but also can be a formal
test, where the respondent generates ideas as dance, or a colloquial expression for having fun.
how to improve a product such as a toy. The From a fluency point of view, baseball and foot-
figural subtests ask the subject to construct a ball would count as two responses, while from
picture from given materials, to complete some the standpoint of flexibility they would count
doodles, and to create drawings from parallel as one. Originality is often translated as statis-
lines. tical infrequency – any response given by fewer
The TTCT was originally presented as a than 5 of 100 (or some other ratio) respondents
research instrument, and Torrance (1966) speci- is termed original. Finally, elaboration attempts
fied five potential uses for the battery: (1) studies to assess the amount of detail included in the
of education in order to yield a more “human” response.
education; (2) studies designed to discover effec-
tive bases for individualized instruction, where Reliability. Treffinger (1985) reported that test-
the indications are that more creative children retest reliability ranges from .50 to .93, with
prefer to learn by discovery, experimentation, most coefficients in the .60s and .70s. These are
and manipulation; (3) use of the tests for reme- marginal figures that suggest the TTCT should
dial and psychotherapeutic programs – for exam- not be used for individual decisions, but seem
ple, studies of children with learning disabilities; adequate for research or group purposes.
(4) studies of the differential results of specific
educational interventions and/or techniques; (5) Validity. Treffinger (1985) indicates that scores
use of tests to become aware of potentialities that on the TTCT are positively related to other con-
might go unnoticed – for example, identifying current criteria, including teacher ratings and
gifted minority children. observed creative-leadership activities, but that
The battery is intended for children, although predictive validity is a much more complex and
it has been used with adolescents and adults; the controversial matter. TTCT scores have a mod-
subtests are gamelike and designed to catch a est but significant correlation with later creative
child’s interest. It is doubtful that most college achievement criteria (e.g., Torrance, 1981).
students and adults would respond with involve- In the test manual, Torrance (1966) presents a
ment, let alone enthusiasm, to many of the items. wide variety of studies covering construct, con-
There are 5- to 10-minute time limits on the sub- current, and predictive validity. Many more such
tests. Note that all of the tasks, even the figural studies have been published in the literature since
ones, require writing or drawing. Speed is essen- then, and a majority are supportive of the validity
tial but artistic quality is not. of the TTCT. As far as content validity, Torrance
(1966) argued that although the TTCT tasks do
Scoring. Hand scoring the TTCT is tedious and not sample the entire universe of creative abili-
requires a well-trained individual, knowledge- ties, they do sample a rather wide range of such
able and experienced with the scoring guidelines. abilities. The test stimuli were selected on the
The test manual and scoring guides are quite clear basis of an analysis of the literature regarding
and provide much guidance. There is also the eminently creative individuals and educational
possibility of sending the test protocols to the theories regarding learning and creativity.
author for scoring (for a fee). The manual also presents a number of findings
The subtests can be scored along four that address the issue of whether TTCT scores are
dimensions: fluency, flexibility, originality, and correlated with intelligence. The results seem to
elaboration (not every subtest yields all four suggest that there is a very low pattern of correla-
scores). These dimensions apply to many other tions with intelligence tests (in the .20s), a slightly
creativity measures, especially the Guilford tests, higher correlation with tests that assess reading
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

208 Part Two. Dimensions of Testing

and/or language (mid .20s to mid .30s), but that


such patterns also seem to be a function of the
characteristics of the sample tested.

Norms. Various group norms are presented by


Torrance (1966), ranging from children in pub-
lic schools to freshmen in a junior college nursing
program. Norms are given according to the three
or four dimensions (fluency, flexibility, original-
ity, and elaboration), and not for the specific sub-
tests or a total score. Most, if not all, of the samples
are samples of convenience, heterogeneous, and
somewhat small (fewer than 100). The norms are
in many cases somewhat puzzling. For example,
on the fluency dimension, a group of seniors in
nursing obtains a mean higher by more than
1 SD than a group of male arts college sopho-
mores. Within any one scoring dimension, such
as elaboration, the SD can be larger by a factor of
3 from group to group (e.g., 10.5 vs. 34.3).
FIGURE 8–2. Guilford’s structure-of-intellect model.
Criticisms Chase (1985) suggests that the con- Based on Guilford (1967).
In Guilford’s structure-of-intellect model, the shaded
struct validity of the TTCT is weak. In fact, the
area, divergent production, relates to creativity.
TTCT was originally intended for research pur- From Guilford, 1967. Copyright  C 1967 by McGraw-
poses and its focus is clearly on what might Hill. Reproduced by permission of the publisher.
be termed “scientific thinking,” i.e., developing
hypotheses, testing these, and communicating
the results (E. Cooper, 1991). Other criticisms the other hand, involves problems that have many
have ranged from its low reliability to its poor possible solutions; although some solutions may
graphics. be more cost effective, aesthetically pleasing, or
Scores on the various subtests do intercorrelate reasonable than others, there is no one “cor-
substantially, with most of the coefficients in the rect” answer. For example, if you were given $100
.30 to .50 range. In fact, within the verbal domain, to decorate your dorm room, what might you
scores on fluency, flexibility, and originality cor- do? Guilford’s structure of intellect model can
relate in the high .70s. be easily represented by a cube, as is done in
Figure 8.2.
Note then, that Guilford’s model encompasses
Guilford’s Tests
both convergent and divergent thinking (Guil-
When we discussed tests of cognitive abilities in ford, 1967b). Guilford and his collaborators
Chapter 5, we mentioned Guilford’s structure of attempted to develop measures that would assess
intellect model. Briefly, you recall that in this each of the 120 dimensions of his model, and a
model intellectual functioning involves the appli- number of the measures used to assess divergent
cation of processes to contents, and this results thinking became popular as measures of creativ-
in products. There were five types of processes ity. Figure 8.3 illustrates the “slice” of the cube
including convergent (thinking) production and concerned with divergent thinking and names
divergent thinking (see Guilford, 1988, for a revi- five of Guilford’s tests that have been used in stud-
sion). Convergent thinking involves problems ies of creativity.
that have one correct answer: “How much will Note that the five examples given in Fig-
eight oranges cost if the price is two oranges for ure 8.3 are not distributed equally across con-
25 cents?” “Who was the ninth president of the tents and products. Some of the specific cells
United States?” “What is the present capital of in Guilford’s model have been difficult and/or
Greenland?” and so on. Divergent thinking, on impossible to translate operationally into reliable
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 209

as represented in Figure 8.2, you can locate the


AUT at the intersection of Divergent production,
classes, and semantic content. The test then is
presumably a measure of a hypothesized factor
of flexibility of thinking.

Scoring. The score is the number of acceptable


responses (i.e., fluency), with credit given for
no more than six responses per item. Thus the
maximum score is 54 (9 items). The question,
of course, is what is an acceptable response? In
the test manual, the authors provide some guide-
lines (e.g., an acceptable use should be possible;
FIGURE 8–3. Some of the Guilford’s tests of diver-
vague or very general uses are not acceptable),
gent thinking and their location in the structure-of-
intellect model. as well as examples of acceptable and unaccept-
1. Consequences – Subject lists consequences of a able responses. Thus, the question of interrater
given event, such as people having three legs. reliability is particularly applicable here.
2. Alternate Uses – Subject states uses for objects.
3. Making Objects – Given simple geometric forms Reliability. The reliability of Guilford’s tests of
such as a circle and rectangle and others, the subject divergent thinking is typically marginal. For
constructs an object. example, for the AUT the manual cites ranges
4. Match Problems – A configuration made up of of reliability from .62 to .85, although it is not
short lines (match sticks) is given, and the subject indicated how such reliability was obtained (test-
is instructed to remove a specific number of matches
retest?). Interrater reliability is not mentioned in
to create a new configuration.
the AUT manual; indeed the term does not appear
5. Possible Jobs – Given an emblem (for example,
depicting a light bulb), the subject is to name occu- in the subject index of Guilford’s The Nature of
pations that might be represented by that emblem. Human Intelligence (1967b), where these tests are
discussed.
and valid measures. In Chapter 5, we mentioned
Validity. The types of validity we discussed in
the SOI-LA test as based on Guilford’s structure-
Chapter 3 are not of interest to Guilford. Rather
of-intellect model. You might recall that 3 of the
he is concerned about factorial validity. Does a
26 subtests of the SOI-LA assess three dimen-
particular measure adequately represent or assess
sions of creative thinking: fluency, flexibility, and
a particular factor dimension? The test manual
originality.
reports that the factor loadings (very much like
correlation coefficients) of the AUT on a factor of
The Alternate Uses Test (AUT) spontaneous flexibility have been .51 and .52 for
adult samples, and .32 to .45 in samples of ninth
The AUT (R. C. Wilson, Christensen, Merrifield,
graders. In the adult samples, AUT scores also had
& Guilford, 1960), earlier called the Unusual Uses
significant loadings with a factor of originality
Test, involves the naming of a common object,
and a factor of “sensitivity to problems.” Tests
such as a newspaper, and listing a maximum of
such as the AUT have actually been used in many
six uses for that common object – uses that must
studies as de facto valid measures of creativity,
be different from each other and from the pri-
but the results suggest that their validity as judged
mary use of that object. Thus a newspaper could
by more traditional methods leaves much to be
be used to make up a kidnap note, to line the
desired (see E. Cooper, 1991, for a critical review).
bottom of drawers, to make a sculpture of papier-
mache, or to make a child’s play hat. As is char-
The Adjective Check List (ACL)
acteristic of most of Guilford’s tests of divergent
thinking, there are several items to the test (nine In one sense, the ACL should have been con-
in the case of AUT) with a very brief time period sidered in Chapter 4 because it is primarily a
(for the AUT 4 minutes for each part composed personality inventory, but it is presented in this
of 3 items). If you look at the Guilford model chapter because its beginnings and its use have
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

210 Part Two. Dimensions of Testing

been closely interwoven with the field of creativ- 128 and 130 respectively), degree of adjustment
ity. The ACL is a very simple device and consists (as judged by the MMPI profile), and major (all
of 300 words, in alphabetical order, from absent- liberal arts).
minded to zany. Subjects are typically asked to At the beginning of the second academic year,
mark those adjectives that are self-descriptive. As different faculty members were given names of
with other personality inventories, the responses students in their classes and asked to make a
of the subject can then be translated into a per- special effort to “observe” these students. At the
sonality profile over a number of scales. end of the semester, the faculty were requested
The ACL actually began its professional life to identify each student as creative or not. The
as an observer’s checklist used to describe well- observed students were of course the creative
functioning subjects, such as architects and writ- nominees and their matched controls. Of the 96
ers, who were being intensively studied at the creative nominees, 13 had left the college; of the
Institute of Personality Assessment and Research 83 remaining, 62 were again identified as creative.
(IPAR). The author of the ACL, Harrison Gough, Of the 96 controls, 6 had left the college and 3 were
published the 300-item version in 1952. Scales identified as creative; these were eliminated and
were then developed on the ACL, including a set a control group of 87 was retained.
of 15 scales based on Murray’s need system, scales At the beginning of the third academic year the
based on Transactional Analysis, and others. A same procedure was repeated, but this time after
test manual was published in 1965, and revised the faculty had observed the specified students
in 1983 (Gough & Heilbrun, 1983). The ACL has for a semester, they were asked to describe each
been quite popular, and has been translated into student on the ACL. Of the 300 words, 59 were
a substantial number of languages, from French used more frequently to describe creative stu-
to Vietnamese. The ACL can be computer or dents, and these became the Creativity scale. The
hand scored. Scoring basically involves subtract- scale includes such expected items as “artistic,”
ing the number of contraindicative items from “imaginative,” and “inventive” (the ACL does not
the number of indicative items for each scale. Raw include the word “creative”), as well as items like
scores are then converted to T scores, according “aloof,” “argumentative,” “dissatisfied,” “intol-
to both gender and the total number of adjec- erant,” and “outspoken.” Globally, the psycho-
tives checked. metric portrait presented by these items is quite
In some ways, the ACL represents an ideal psy- consonant with both empirical findings and the-
chometric instrument. It is simple to adminis- oretical expectations about creative individuals.
ter, brief, nonthreatening and noninvasive to the Is there any evidence that the students
respondent, scorable either by hand or computer, observed by the faculty were indeed creative? One
amenable to statistical and logical analyses, use- line of evidence is that they were independently
ful both as an observer- or self-descriptive instru- nominated as creative by two distinct groups of
ment, and almost limitless in its range of appli- faculty members. The author (G. Domino, 1970)
cations (see Fekken, 1984, for a review). Because also compared the creative and control students
it is an open system such as the MMPI and CPI, on three measures of creativity and found that
new scales can be developed as needed, and in creatives scored higher on all three measures.
fact a number of investigators, including Gough Schaefer and Anastasi (1968; Schaefer, 1967),
himself, have developed scales of creativity on the had studied a group of 800 high-school stu-
ACL (G. Domino, 1994). dents evenly divided as to creative or control,
male vs. female, and field of creativity (science
The Domino Creativity (Cr) Scale on the ACL. vs. art). These students had filled out a self-
The Domino Cr scale on the ACL was developed descriptive ACL so their protocols were scored
by asking the faculty of an all-male liberal arts for the Domino Cr scale. The results are given in
college, at the end of the academic year, to identify Table 8.4 in T scores.
all freshmen who had shown creative abilities (see In each comparison, the ACL Cr scale sta-
G. Domino, 1970, for more specific details). A tistically differentiated between creatives and
total of 96 students were nominated and these controls, but showed no significant differences
were matched with a control group as to gender between gender and field of study. Note also,
(all males), age (modal age of 18), IQ (means of that although the scale was originally developed
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 211

Table 8–4. T Scores for 800 High-School Myers and McCaulley (1985), the median score
Students on the ACL Creativity Scale on this index is about 300 (SD = 96.8), with
Males M SD
mean scores for various samples ranging from
221 to 365. Scores of 350 or higher are suppos-
Artistic creative 54.48 8.71
edly indicative of creative potential. Fleenor and
Artistic control 45.52 8.94
Scientific creative 54.38 9.72 Taylor (1994) indeed found that scores on this
Scientific control 45.62 7.95 creativity index were substantially correlated to
Females M SD
two other “personality type” self-report measures
of creativity.
Artistic creative 52.37 9.60
Artistic control 47.63 9.62
Literary creative∗ 54.14 9.66
Chinese Tangrams
Literary control 45.86 8.28

There were not sufficient numbers of females studying
There are a substantial number of creativity
science; hence for females, the fields of study were artis- measures available (see the Journal of Creative
tic and literary. Behavior that publishes lists of these measures at
irregular intervals), but few exist that are equally
on the basis of observer ratings of male college applicable to children as well as adults, that do not
students, it was cross-validated on self-ratings involve extensive verbal skills, and that require
of both male and female high-school students. the subject to produce a potentially creative
Other studies of inventors, of dance and music product without needing technical or specialized
students, of scientists and architecture students, skills.
have supported the validity of the scale (Albaum G. Domino (1980) chose a somewhat pop-
& Baker, 1977; Alter, 1984; 1989; G. Domino, ular puzzle known as Chinese Tangrams, and
1994). The available literature (see G. Domino, explored its feasibility as a test of creativity. The
1994) supports the construct and criterion valid- tangram consists of a square, usually made of
ity of this scale. paper, cut into seven pieces, as illustrated in
Some reliability information, primarily of the Figure 8.4.
internal stability type is available. For example, Tangrams can be used for three major
G. A. Davis and Bull (1978) reported a coeffi- activities: (1) to reproduce a given completed fig-
cient of .91 and Ironson and G. A. Davis (1979) ure using all seven pieces; (2) to solve combina-
of .90 for college students; G. Domino (1994) torial problems, such as the number of different
reported .81 and .86 for samples of scientists and convex polygons that can be generated; and (3) to
architecture students. create “original” pictures, the focus of this effort

The MBTI revisited. In Chapter 4, we covered the


Myers-Briggs Type Indicator. Although the MBTI
is a personality inventory, several of its scales are
empirically and theoretically related to creativity,
and in fact much of the early validational work on
the MBTI was carried out at IPAR. In Chapter 3,
we mentioned multiple regression as a method of
combining test scores, and it might be worthwhile
to revisit these two topics at this point.
On the MBTI there is a creativity index that
can be calculated using the following regression
equation (A. Thorne & Gough, 1991):

MBTI Creativity Index = 3SN + JP − EI − .5TF

Thus, we administer the MBTI to a subject, score


the inventory on its four scales and place the raw
scores in the above equation. According to I. B. FIGURE 8–4. The Chinese Tangram puzzle.
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

212 Part Two. Dimensions of Testing

(c) A rocket

(a) A person

(d) Going south

(b) A fox
FIGURE 8–5. Illustrative responses on the Chinese Tangram.

(M. Gardner, 1974). The tangram is presented Interrater reliability ranged from .72 to .96 in
as a game and its use illustrated by the examiner. a sample of third graders, and from .76 to .92 in a
Each person is given a tangram and is shown how sample of college students. In one study of third
the seven pieces can be put together to make a graders, children in a gifted class were compared
person, an animal, an object, or a more abstract with children in the standard classroom; gifted
idea, as illustrated in Figure 8.5. The examiner children scored higher on all three dimensions.
then points to the fox and asks what else might In another study, 57 college students were admin-
that be. Recognizable responses (e.g., a coyote, a istered the tangram and a battery of creativity
skunk, a poodle) are praised. If not indicated by measures, including two of Guilford’s tests. Of
anyone, the examiner suggests that the fox might the 15 correlation coefficients, 9 were statistically
also be a smiling or jumping cat. The intent is significant. In addition, scores on the tangram
to imply that the seven pieces can result in many test were not correlated with measures of intelli-
configurations including anthropomorphic and gence or academic achievement (see G. Domino,
animated ones. Subjects are then asked to create 1980, for details on all these studies).
their own picture, something creative and origi-
nal that perhaps no one else will think of, but that
General Comments About Creativity Tests
others will recognize. The results can be scored on
fluency (number of acceptable responses), flexi- Interscorer and intrascorer reliability. With
bility (number of response categories), and orig- most tests of creativity, scoring the test protocol
inality (statistical infrequency). requires a fair amount of subjective judgment,
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 213

and the degree to which specific guidelines, not. Hocevar (1979) argues that such subjective
examples, and/or training are provided varies judgment is better than statistical infrequency as
widely. Therefore, with most creativity tests the a criterion of originality. Runco and Mraz (1992)
question of interscorer reliability is important. argue that both of these methods are unrealis-
Would two different individuals scoring a set tic in that ideas are not produced singly; what is
of protocols independently assign identical (or of interest is the ideational potential of the sub-
highly similar) scores? The answer seems to be ject. They therefore asked a number of judges to
yes, and typical correlation coefficients can be assess a subject’s responses in toto, as a unit. They
in the .90s range, provided the scorers focus on also asked judges to rate “creativity” rather than
the task seriously and are given clear guidelines originality. The results indicated that these rat-
and/or explicit training. ings had high interrater reliabilities, but unfortu-
Given that scoring a test protocol is not the nately poor discriminant validity in that ratings
most exciting activity, we need to ask whether the of creativity and ratings of intelligence correlated
scorer’s reliability changes over time as a function substantially (r = .58).
of fatigue, boredom, or other aspects. The answer The issues here represent basic questions that
here also seems to be yes, but intrarater stabil- need to be asked of any psychological test. To
ity can be enhanced by not making the task too what degree is the scoring of the items objec-
onerous – e.g., allowing coffee breaks, using aids tive or subjective? Is the scoring one that assesses
to summarize scoring rules, keeping the number individual items or one based on some global
of protocols scored at one sitting to a minimum, assessment? If we score individual items, how
and so on. can these be placed together to maximize the
In this context, it should be pointed out correlation with some outside criterion? If we
that interrater reliability in psychological test- use global judgment, how can we make sure that
ing seems to be superior to that found in other the judgment is not contaminated by extraneous
fields. For example, R. L. Goldman (1992; 1994) variables?
reviewed the medical literature on peer assess- Measures of creativity have been questioned
ment of quality of care (the degree to which differ- from a variety of viewpoints. One basic ques-
ent physicians agree in their assessment of med- tion is whether performance on a test of creativity
ical records), and found that such reliability was is related to real-world creativity. Another ques-
only slightly higher than the level expected by tion is whether performance is a function of the
chance. particular stimuli and instructions given to the
person; several studies have shown that changes
Some scoring problems. Conventional scoring in test instructions result in changes in test per-
methods for many of the tests of creativity present formance, such as asking subjects to answer as if
a number of problems. Consider for example they were highly creative (e.g., Lissitz & Willhoft,
the assessment of originality, which seems to 1985; S. V. Owen & Baum, 1985). Some have
be a crucial component of creativity. To assess questioned whether the four divergent thinking
whether a response is original or not, we have dimensions of fluency, flexibility, originality, and
two basic options. One is to collect responses elaboration are separate aspects of creativity.
from a large group of individuals (i.e., norms).
These responses are tabulated and any infrequent
response, for example given by fewer than 5 out
IMAGERY
of 100 respondents, is classified as original. If we
now are scoring a new sample of protocols, we can One area that is closely related to creativity is that
use the tabulation as a guideline. We will need to of imagery, the ability to purposely visualize in
make a number of subjective decisions. For exam- one’s mind objects that are not present. Because
ple, if our test item is to name round objects that we are not able, at present, to get into someone
bounce, is a “bowling ball” acceptable or not? Is else’s mind, most measures of imagery are self-
a “pelota” different from a ball? report questionnaires. Galton (1907) was proba-
A second method is to have raters, for example bly the first psychologist to investigate imagery
school teachers, rate the response as original or through a self-report questionnaire, by asking
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

214 Part Two. Dimensions of Testing

subjects to imagine their breakfast table that to measure imaginal and verbal traits. Basically,
morning and noting how vivid that image was. Paivio hypothesized that many situations and
tasks can be conceptualized either verbally or
nonverbally, and that individual persons differ
Marks’s Vividness of Visual Imagery
in the extent to which their thinking uses one or
Questionnaire (VVIQ)
another of these modalities. True-false items were
Today probably the most commonly used mea- then developed, based on the above assumption
sure of imagery is Marks’ (1973) Vividness of and on the more formal “dual coding” theory.
Visual Imagery Questionnaire (VVIQ), although The items generated (e.g., “I am a slow reader,”
it has been severely criticized (e.g., Chara & “I enjoy learning new words,” “I use mental pic-
Verplanck, 1986; Kaufman, 1983). This 16-item tures quite often”) covered various preferences,
questionnaire asks the respondent to think, in abilities, and habits. Paivio developed a sample set
turn, of a relative or friend seen frequently, a of items and then asked several graduate students
rising sun, a shop, and a country scene, and in to generate more. The final set of 86 items con-
each case visualize the picture that comes in the tained an approximately equal number of items
“mind’s eye.” For each item, the respondent uses referring to the two symbolic modes and approx-
a 1 to 5 rating scale, where 1 is defined as “per- imately equal numbers of each type keyed true or
fectly clear and as vivid as normal vision” and 5 false. Apparently, the inclusion or exclusion of an
is “no image at all, you only ‘know’ that you are item was based on logical grounds and not sta-
thinking of the object.” The scale is administered tistical analyses. The IDQ yields two scores, one
twice, once with eyes open and once with eyes an “imaginal” and one on the “verbal” scale. The
closed. actual items, instructions, and scoring key can be
D. F. Marks (1973) reports a test-retest reli- found in Paivio and Harshman (1983).
ability of .74 for a sample of 68 subjects, with In a series of factor analyses on data from 100
no time period indicated, and a split-half reli- college students, both two factors and six factors
ability of .85. He also reports three different were obtained. The two factors seem to paral-
studies, two with college students and one with lel the imaginal and the verbal dimensions. The
high-school students, in which in each case the six factors were labeled: (1) good verbal expres-
most extreme high scorers were compared with sion and fluency (e.g., I express my thoughts
the most extreme low scorers in the degree of clearly); (2) habitual use of imagery (e.g., I often
recall of colored photographs. In all three experi- use mental images); (3) concern with correct use
ments “good” visualizers (i.e., low scorers on the of words (e.g., I am very careful about using
VVIQ), were more accurate in recall than “poor” words correctly); (4) self-reported reading dif-
visualizers (i.e., high scorers on the VVIQ). In ficulties (e.g., I am a slow reader); (5) use of
two of the studies, D. F. Marks (1973) found images to solve problems (e.g., I use images to
that females had greater accuracy of recall than solve problems); and (6) vividness of dreams,
males. daydreams, and imagination (e.g., my dreams are
quite vivid). Note that of the six factors, three are
verbal and three reflect imagery. The six factors,
Paivio’s Individual Differences
however, cover only 47 of the 86 items, and two
Questionnaire (IDQ)
of the imagery factors are composed of two and
Another relatively popular imagery question- four items respectively, so their practical utility is
naire is the Individual Differences Question- quite limited.
naire (IDQ). Paivio (1975) proposed that human Alpha reliability coefficients for the six factor
thinking involves a continuous interplay of scales range from a low of .72 to a high of .84
nonverbal imagery and verbal symbolic pro- (Paivio & Harshman, 1983). In the initial study,
cesses, which are interconnected but function- no validity data are presented; the focus is clearly
ally distinct; this proposal led to a “dual cod- on factor analysis. Subsequent studies have used
ing” theory that spells out the implications of factor 2, the habitual use of imagery, as a scale
such a model. As part of his efforts, Paivio (1971; with positive results (e.g., B. H. Cohen & Saslona,
Paivio & Harshman, 1983) developed the IDQ 1990; Hiscock, 1978).
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 215

Hiscock (1978) investigated the IDQ in a series construct validity of the VVQ. Part of the diffi-
of four studies in which a revised version of culty may be that the VVQ considers verbal-visual
the IDQ was compared with other measures of abilities as opposite ends of a continuum, while
imagery. Cronbach’s alpha for the imagery scale the literature suggests that these two processes are
was found to be .80, .81, and .87 in three different parallel and independent.
samples, while for the verbal scale, the obtained Parrott (1986) also found the VVQ not to relate
values were .83, .86, and .88. Retest reliability, to visual imagery, spatial-ability level, and exam-
based on a sample of 79 college students retested ination grades in a sample of mechanical engi-
2 to 6 weeks later, was .84 for the imagery scale and neering students. Boswell and Pickett (1991) sug-
.88 for the verbal scale. Correlations between the gested that the problem with the VVQ was its
IDQ scales and other measures of imagery were questionable internal consistency. They adminis-
modest at best, ranging from a high of .49 and tered the VVQ to a sample of college students and
.56 to several nonsignificant values. computed K-R 20 coefficients; these were .42 for
Kardash, Amlund, and Stock (1986) used 34 males, .50 for females, and .47 for the total sam-
of the 86 items that met certain statistical cri- ple. They carried out a factor analysis and found
teria and did a factor analysis. Of the 34 items, six factors, but only the first two were logically
27 were verbal and 7 were imagery items. The meaningful; these were labeled as “vividness of
27-item verbal scale had an internal consistency dreams” and “enjoyment of word usage.”
reliability of r = .74, while for the 7-item imagi-
nal scale, the r = .52. The authors concluded that General concern about imagery questionnaires.
the IDQ is indeed a multifactor instrument that Unfortunately there seems to be a lack of con-
needs improvement but can be interpreted on sistent correlation between subjective reports of
the basis of either two factors or five factors (the visual imagery and actual performance on tasks
items on the factor “use of images to solve prob- that seem to depend on such visual imagery.
lems” were dropped from consideration because Part of the problem may well be the complicated
of statistical reasons). nature of visual imagery, and part of the problem
seems to be the assessment devices used in such
studies. It may well be that questionnaires such
The Verbalizer-Visualizer
as those by Marks and by Paivio are useful and
Questionnaire (VVQ)
predictive only under certain circumstances – for
A third somewhat popular measure of imagery example, only when the spontaneous use of such
is the VVQ, developed by A. Richardson (1977) imagery is helpful (e.g., R. G. Turner, 1978). On
to measure individual differences on a verbal- the other hand, at least one review of such mea-
visual dimension of cognitive style. The 15 items sures concluded that they are reliable and have
on the scale were selected empirically to discrim- predictive validity (K. White, Sheehan, & Ashton,
inate between individuals who habitually move 1977).
their eyes to the left when thinking vs. eyes to
the right because such eye movements are pre-
COMPETITIVENESS
sumably linked to hemispheric brain functions.
The items on the VVQ ask whether the subject Competitiveness is a salient characteristic of
has good facility of words, whether they think everyday life, particularly in American cul-
in images, whether their dreams and daydreams ture. There are many perspectives one can
are vivid, and so on. The author himself cross- use in assessing competitiveness, and Smither
validated the results (A. Richardson, 1978) only and Houston (1992) identified four major
to find results that were in the opposite direction ones: achievement motivation, sports psychol-
from what was expected! ogy, experimental social psychology, and person-
Edward and Wilkins (1981) conducted two ality assessment. Achievement motivation was
studies designed to explore the relationship of the historically tested by projective techniques, in
VVQ with measures of imagery and of verbal- particular the Thematic Apperception Test, but,
visual ability. The results, particularly of the subsequently, a large number of objective inven-
second study, cast considerable doubt on the tories were developed. A good example is the CPI
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

216 Part Two. Dimensions of Testing

discussed in Chapter 4, which contains two scales than .40. The internal consistency of this 20-
related to achievement motivation. Smither and item form was .90. Scores on this scale, which
Houston (1992) argue that competitiveness and the authors call the Competitiveness Index, were
achievement motivation are not necessarily the significantly correlated with measures of achieve-
same; they may occur in the same individual, ment motivation. A factor analysis yielded three
but competitiveness need not be present in a factors labeled as “emotion,” “argument,” and
highly achieving person. Competitive behavior “games.” Whether this competitiveness index is
is of course central to sports, and a number of useful, remains to be seen.
questionnaires have been designed to assess com-
petitiveness in the context of sports (e.g., Fabian
HOPE
& Ross, 1984; Gill & Deeter, 1988). In the area
of experimental social psychology, competitive- Hope is of particular interest to the fields of
ness has been studied extensively by looking at health psychology, psychological medicine, and
how valued rewards are distributed in a group, nursing since, anecdotally at least, there seems
or by having subjects engage in games in which to be evidence that hope and the course of ill-
they can choose to cooperate or compete. G. nesses such as cancer are intimately related (e.g.,
Domino (1992), for example, compared Chinese J. Dufault & Martocchio, 1985). There is evidence
and American children on a game-like task and to suggest links between hope and health (e.g.,
found that Chinese children were more likely to Gottschalk, 1985; Kiecolt-Glaser & Glaser, 1986),
engage in cooperative responses that maximized hope and psychotherapeutic outcome (e.g.,
group rewards, while American children were J. Frank, 1968), and even adherence to weight-
more likely to engage in competitive responses loss programs (e.g., Janis, 1980). Many writers
that maximized individual gain. Finally, compet- have suggested that hope is a unidimensional
itiveness can be viewed from the perspective of construct that involves an overall perception that
personality; it may be seen as a pervasive quality goals can be met (e.g., Erickson, Post, & Paige,
present not only in sports but in most human 1975; Gottschalk, 1974; Mowrer, 1960; Stot-
activities. land, 1969). As Staats (1989) indicates, philoso-
Smither and Houston (1992) argued that in phers and practitioners have long recognized the
a review of the literature, they found no mea- importance of hope, but few efforts have been
sures of competitiveness that were independent made to assess this construct. We will briefly
of achievement motivation, generalizable rather look at two somewhat recent measures of hope
than focused on athletics and psychometrically (for other measures see Fibel & Hale, 1978;
sound. They therefore set out to develop such Obayuwana et al., 1982; Staats, 1989; Staats &
a measure. They generated a set of 67 items Stassen; 1985).
designed to identify persons who prefer competi-
tive over cooperative situations (e.g., “I am a com-
The Miller Hope Scale (MHS)
petitive person”) and administered these items
and a scale of achievement motivation to a sam- Miller and Powers (1988) developed a Hope scale
ple of 84 working adults. Scores on the two scales based on a broader conceptualization of hope
correlated .68. An item analysis correlating each than merely as an expectation for goal achieve-
item to the total score, yielded 41 items with sig- ment. They saw hope as a “state of being,” char-
nificant correlations. It is not clear exactly what acterized by an anticipation for a continued good
the authors did here because they state that the or improved state and marked by 10 critical ele-
41 items “discriminated significantly between ments such as mutuality-affiliation (i.e., hope is
high and low scorers,” but they indicate that the characterized by caring, trust, etc.) and anticipa-
items were dropped or retained on the basis of the tion (i.e., looking forward to a good future).
item-total correlation (rather than a contrasted The Miller Hope Scale (MHS) is a 40-item
group approach suggested by the quote). They scale that uses a 5-point response format from
carried out an item analysis on a second, larger “strongly agree” to “strongly disagree” (5 to 1
sample of undergraduate students and retained points). Apparently, the initial item pool con-
20 items based on item-total correlations larger sisted of 47 items that were evaluated by four
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 217

judges as to the degree of “hope information” The Snyder Hope Scale (SHS)
assessed by each item. Seven items were elim-
Snyder et al. (1991) suggested that hope is made
inated. The 40 items were then given to “six
up of two components, encapsulated by the pop-
experts in measurement” who critiqued the items
ular saying “Where there is a will there is a way.”
– but because the scale continued to be 40 items
The “will” has to do with goal-directed agency,
long, it is not clear whether the critiques were
while the “way” has to do with pathways. Both
used to change the wording of items vs. delet-
agency and pathways are necessary, but neither is
ing items, or some other purpose. The MHS was
sufficient to define hope.
administered to 75 students (primarily nursing
A pool of 45 items written to reflect both
students) as a pilot sample and to a convenience
agency and pathways aspects of hope were
sample of 522 students from two universities,
administered to a sample of introductory psy-
who were also administered other scales.
chology students. A 4-point response scale (from
definitely false to definitely true) was used. Item-
Reliability. For the pilot sample, Cronbach’s
total correlations were computed and 14 items
alpha was computed to be .95, and test-retest
with correlations larger than .20 were retained.
reliability over a 2-week interval was .87. For the
The senior author then decided to keep the four
convenience sample, alpha was .93 and test-retest
items that most clearly reflected the agency com-
reliability, also over a 2-week period, was .82.
ponent (e.g., “I energetically pursue my goals”)
Thus the initial reliability of this scale, viewed
and the four items that most clearly tapped the
from either the point of internal consistency or
pathways component (e.g., “I can think of many
stability over time is quite good. We would want
ways to get out of a jam”); the final scale then,
to round out this picture with samples other than
contains these eight hope items plus four filler
college students and with somewhat greater time
items.
intervals.

Validity. Scores on the MHS correlated substan- Reliability. Cronbach’s alphas for the total scale
tially (.71 and .82) with two measures of psy- range from .74 to .84, for the agency subscale from
chological well-being that presumably measured .71 to .76, and for the Pathways subscale, they
purpose and meaning in life, somewhat less (.69) range from .63 to .80. Given that the subscales
with a one-item self-assessment of hope, and neg- have only four items each, the results seem quite
atively (−.54) with a hopelessness scale. A factor acceptable. Test-retest reliability was assessed in
analysis of the MHS suggested that a three-factor four samples of college students, with retest inter-
solution was the best, which the authors inter- vals ranging from 3 weeks to 10 weeks; the result-
preted as satisfaction, avoidance of hope threats, ing correlation coefficients ranged from .73 to .85.
and anticipation of the future.
Note a bit of a problem here. If a scale correlates Validity. Factor analyses across various samples
substantially with other scales that presumably suggest that a two-factor solution parallelling
measure different constructs, then we can ques- the agency-pathways distinction makes sense; the
tion whether in fact the scales are measuring the two factors seem robust in that they account for
same construct. Here is a hope scale that corre- 52% to 63% of the variance. At the same time, the
lates substantially with measures of psychological two factors correlate with each other – with corre-
well-being; does the MHS measure hope or some lation coefficients ranging from .38 to .57. A series
other variable, such as purpose in life, psycholog- of studies assessed convergent validity. Scores on
ical health, adjustment, a desire to present one’s the SHS were positively correlated with mea-
self favorably? If we could arrive at a consensus – sures of optimism, perceived control, and self-
unlikely in reality but thinkable theoretically – esteem, and negatively correlated with measures
that all these scales are measuring hope, then we of hopelessness, depression, and psychopathol-
need to ask why use the MHS? What advantages ogy, as assessed by the MMPI. A number of
does this scale have over others? Is it shorter? Does other, more complicated studies are presented
it have better norms? It is more easily available? that address the construct validity of the scale
Does it have higher reliability? (Snyder et al., 1991). For example, predictions
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

218 Part Two. Dimensions of Testing

were made that high scorers would maintain their Forsythe, et al., 1987; Ewedemi & Linn, 1987;
hopefulness in the face of stress (“when the going C. K. Holahan & C. J. Holahan, 1987; Kanner,
gets tough, the tough get going”), while low scor- Feldman, Weinberger, et al., 1987). To compile
ers would deteriorate in their hopefulness. Pre- an item pool, the authors asked more than 200
dictions were also made that higher-hope indi- undergraduate students at a Canadian university
viduals would have a greater number of goals to list one or two hassles they had recently expe-
across various life aspects and would select more rienced in nine different areas, such as school,
difficult goals. These and other hypotheses were work, family, and financial. The obtained items
generally supported. There were no gender dif- were classified into the nine areas, tabulated as to
ferences obtained on this scale, despite the expec- frequency, and finally 20 items were selected to
tation by the authors that there would be. represent the nine areas. A new sample of more
than 400 students completed this Brief College
Student Hassles Scale (BCSHS) along with other
HASSLES
measures. The instructions asked the subject to
Life is full of hassles, and how we handle these check each item on a 1- to 7-point scale, ranging
seems to be an important variable. Some people from no hassle to extremely persistent hassle, and
are able to follow the old adage, “When life gives to use the preceding 1 month as a time frame.
you lemons, make lemonade.”
Reliability and validity. Reliability was assessed
by the alpha coefficient, calculated to be .81 –
The Hassles Scale
quite high considering that the items theoreti-
The Hassles scale (Kanner, Coyne, Schaefer, et al., cally represent diverse areas. Some minor gender
1981) consists of 117 minor stressors that occur differences were obtained, with women rating
on a frequent basis. Hassles can involve minor the items “academic deadlines,” “weight,” and
irritations, such as losing things or bigger prob- “household chores” higher than men. Scores on
lems like getting a traffic citation. Respondents to the BCSHS were negatively correlated with a
the Hassles Scale are asked to note the occurrence measure of optimism (greater optimism asso-
and rate the severity of these daily life hassles on ciated with less persistent hassles), and posi-
a 4-point scale ranging from none to extremely tively correlated with more persistent problems
severe, and a time frame of the “last month.” The with negative affect (e.g., anxiety, loneliness, and
items are fairly general (e.g., home maintenance; unhappiness).
health of a family member) rather than specific
(e.g. the roof is leaking and I had to call the apart-
LONELINESS
ment manager three times). The Hassles Scale was
revised in 1988 (DeLongis, Folkman, & Lazarus, Loneliness has gained recognition as a complex
1988). behavior that affects a majority of people. Ruben-
Some studies have found that daily hassles stein and Shaver (1982) compared loneliness with
are more predictive of self-reported adjustment hunger. Just as our bodies signal a need for food
difficulties than other measures (e.g. DeLongis, by sensations of hunger, our psyche signals a need
Coyne, Dakof, et al., 1982; Wolf, Elston, & for emotionally sustaining ties by the sensation
Kissling, 1989). of loneliness. Through a newspaper survey, they
found that loneliness was indeed widespread;
Development of a college form. The Hassles other researchers indeed agree (e.g., Peplau &
Scale, although used in a number of studies Perlman, 1982; R. S. Weiss, 1973). Peplau and
with college students, contains a number of Perlman (1982) believe that loneliness is marked
items that do not apply to many college students by three aspects: (1) It is subjective – being alone is
(e.g., job security, problems with spouse), and so not equal to being lonely; (2) Loneliness is expe-
Blankstein, Flett, and Koledin (1991) developed rienced as distressing; (3) Loneliness is marked
a form for use with college students (for hassles by a perceived deficiency in social relationships.
scales developed for particular age groups such As with other areas of research, there are
as adolescents or the elderly (see Compas, Davis, a number of issues upon which there is
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 219

disagreement. One is the distinction between Hays & DiMatteo, 1987; Hojat, 1982). Part of the
loneliness and other terms such as aloneness, soli- problem with the ULS is that at least some of the
tude, and seclusion. Another is whether loneli- items are worded in a confusing or ambiguous
ness is a state or a trait, or both. Still, a third manner (Hartshorne, 1993). Hartshorne (1993)
is the differences, if any, between transitory and suggests that loneliness as measured by the ULS
chronic loneliness. is a bimodal emotional state; that is, at the time
of testing individuals either are or are not lonely.
The ULS has been used in a number of cross-
The UCLA Loneliness Scale (ULS)
cultural studies. For example, in Japan the ULS
One of the most frequently used and psychome- was reported to have high internal consistency
trically sound instruments for the measurement (alpha of .87, split-half of .83), and reasonable
of loneliness is the UCLA Loneliness scale, origi- test-retest correlation over a 6-month period
nally presented in 1978 (Russell, Peplau, & Fergu- (r = .55). Lonely Japanese reported more lim-
son, 1978) and revised in 1980 (Russell, Peplau, ited social activities and relations and tended to
& Cutrona, 1980). regard their parents as being disagreeable, cold,
The ULS was initially developed by using 25 and untrustworthy; they had lower self-esteem
items from another longer (75 items) loneliness and experienced more medical and psychologi-
scale. This pool of items was administered to a cal problems (Kudoh & Nishikawa, 1983). (For
sample of UCLA students who were either mem- another loneliness scale see N. Schmidt & Ser-
bers of a discussion group on loneliness (n = 12), mat, 1983; see also the Journal of Social Behavior
a Social Psychology class (n = 35), or in Introduc- and Personality, 1987, 2, No. 2; the entire issue is
tory Psychology (n = 192). Correlations between devoted to loneliness).
item and total score indicated that 5 of the items
correlated below .50 and hence were dropped.
DEATH ANXIETY
The final scale of 20 items was then assessed
for reliability and a coefficient alpha of .96 was Lives take many forms, but we all share in the
obtained. A sample of 102 students retested over reality of our eventual death. Some people are
a 2-month period, yielded a test-retest r of .73. made extremely anxious by this thought, while
Scores on the ULS correlated .79 with self-ratings others are relatively unconcerned, and still others
of loneliness, and the mean was significantly dif- look forward to death as a transition. Thus, death
ferent for those students who had volunteered anxiety is an important variable, and a number of
for the discussion group on loneliness than for investigators have developed scales to assess this
a control sample. Furthermore, scores correlated concept. Probably, the best known of these scales
significantly with self-ratings of depression, anx- is the Death Anxiety Scale by Templer (1970).
iety, and endorsement of several adjectives such
as “empty” and “shy,” and negatively with self-
Death Anxiety Scale (DAS)
ratings of satisfaction and of happiness.
The ULS is thus a unidimensional measure Templer (1970) wrote that up to that time three
composed of 20 items, 10 worded positively and methods of assessment had been used in the mea-
10 worded negatively. Subjects respond on a 4- surement of death anxiety: interviews, projec-
point scale of never (1), rarely (2), sometimes tive techniques, and questionnaires, but that with
(3), and often (4). Coefficient alphas are typ- the exception of one questionnaire, the reliabil-
ically in the .80s range for early adolescents ity and validity of these procedures had not been
(Mahon & Yarcheski, 1990), and in the .90s for reported in the literature.
college students and adults (Hartshorne, 1993; Templer (1970) developed the DAS by devising
Knight, Chisholm, Marsh, et al., 1988). The 40 items on a “rational” basis – that is, because
results of factor-analytic studies are however this was his doctoral dissertation, he presum-
mixed. Some researchers have found one factor ably read the pertinent literature, talked to his
(e.g. Hartshorne, 1993), others two factors (e.g., committee members, and finally wrote down
Knight, Chisholm, Marsh, et al., 1988; Zakahi & these items. Four chaplains in a state mental
Duran, 1982), and some, four or five factors (e.g., hospital, two graduate students, and one clinical
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

220 Part Two. Dimensions of Testing

psychologist, rated the face validity of each item; SUMMARY


31 items survived. These items were then embed-
This chapter looked at a selected number of top-
ded in a set of 200 MMPI items to disguise their
ical areas to illustrate various uses of tests and
intent and were administered to three groups of
some basic issues. In the area of self-concept, the
college students. Those items that correlated sig-
Tennessee Self-Concept Scale illustrates some of
nificantly with total scores in at least two of the
the challenges that need to be met. In the area
three samples were retained. This yielded a scale
of locus of control, the Rotter scale occupies a
of 15 items.
central position and gave birth to a multitude of
The DAS is reported to have adequate test-
such scales. Sexuality is yet another active field
retest reliability (.83 for a sample of college stu-
of research where various scales and psychomet-
dents retested after 3 weeks) and internal consis-
ric methodologies can provide potentially useful
tency (K-R20 coefficient of .76). Templer (1970)
results. In the area of creativity, the challenges are
presents a series of small pilot studies. Two
many, and the measurement there is, if not still
of the studies suggest that the DAS is free of
in its infancy, certainly is in its childhood. Other
nuisance components such as acquiescence and
areas such as hope and death anxiety illustrate a
social desirability. Two other studies address the
variety of approaches and issues.
validity of the DAS. In one, psychiatric patients
who had spontaneously verbalized fear of death
were compared with a control sample: the respec-
SUGGESTED READINGS
tive DAS means were 11.62 and 6.77, a statistically
significant difference. In another study with col- Barren, F. (1958). The psychology of imagination.
lege students, scores on the DAS were found to Scientific American, 199, 151–166.
correlate with MMPI measures of general anx- A fascinating look at the study of creativity as it was under-
iety, with another death-anxiety questionnaire, taken at the Institute of Personality Research and Assessment
of the University of California at Berkeley. Barron briefly illus-
and with an experimental task designed to assess trates a number of techniques used in these assessment stud-
“number of emotional words.” ies, including the Welsh Figure Preference Test, a Drawing
A number of investigators have used the DAS Completion test, and an inkblot test.
with varying results. For example, Donovan
(1993) translated the DAS into Brazilian Por- Mednick, S. A. (1962). The associative basis of the
tuguese. This was done by what is called the creative process. Psychological Review, 69, 220–232.
back translation method (Brislin, 1970), where For the area of creativity, this is somewhat of a classic paper
in which the author proposes an associative interpretation of
an English language instrument is translated into
the process of creative thinking and presents a test of creativ-
another language, the other language version is ity – the Remote Associates Test – based on the theoretical
retranslated into English, and then compared approach.
with the original English version. Brislin (1970)
also suggests that to check the adequacy of trans- O’Donohue, W. & Caselles, C. E. (1993). Homophobia:
lation, one could administer the two forms to Conceptual, definitional, and value issues. Journal of
a bilingual sample. This was done by Donovan Psychopathology and Behavioral Assessment, 15, 177–
(1993) in a modified manner where some Brazil- 195.
ian students took the odd-numbered items in This paper examines the concept of homophobia – negative
attitudes, beliefs, or actions toward homosexuals – and in par-
English and the even-numbered items in Por- ticular looks at a number of scales that have been developed
tuguese (form A), and some students took the to assess this variable.
odd-numbered items in Portuguese and the even-
numbered items in English (form B). For form A, Rotter, J. B. (1990). Internal versus external control of
the split-half reliability was calculated to be .59, reinforcement. A case history of a variable. American
while for form B it was .91. The convergent and Psychologist, 45, 489–493.
discriminant validation of this Portuguese ver- The author considers why locus of control has become such
sion was carried out by comparing the DAS with a popular variable. He identifies four aspects as important:
(1) a precise definition of locus of control; (2) the imbed-
various other measures such as the STAI (also
ding of the construct in a broader theory; (3) the develop-
in Portuguese). The results were quite similar to ment of a measure from a broad psychological perspective;
those obtained in American samples. and (4) the presentation of the initial scale in a monograph
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Normal Positive Functioning 221

format, rather than a brief journal article, allowing for ing, what might be some of the topic areas you
sufficient detail and data presented from a programmatic would cover and why?
effort. Rotter makes a plea for the importance of broad theo-
ries in psychology. 2. What else could have been done to generate
primary, secondary, and tertiary validity data for
Snyder, C. R. (1996). Development and validation of
the Self-Esteem Questionnaire?
the State Hope Scale. Journal of Personality and Social
Psychology, 70, 321–335. 3. What are some of the aspects of sexuality that
A report of four studies to develop and validate a measure of
might be assessed through psychological tests?
state hope. 4. Assuming that you have a modest research
budget, how would you develop a creativity scale
on the ACL?
DISCUSSION QUESTIONS
5. How would you show that the Chinese Tan-
1. If you were writing this chapter on psycholog- grams measures creativity rather than say, spatial
ical tests that measure normal positive function- intelligence?
P1: JZP
0521861810c08 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

222
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

PART THREE: APPLICATIONS OF TESTING

9 Special Children

AIM This chapter looks at psychological testing in the context of “special” children.
We first look at some specific issues that range from the laws that have impacted
the psychological testing of handicapped children, to issues of infant intelligence. In
the process, we look at several instruments that are broad-based and nicely illustrate
some of these issues. Then we look at nine major categories of special children, not
exhaustively, but to illustrate various issues and instruments. Finally, we return to some
general issues of this rather broad and complicated area.

SOME ISSUES REGARDING TESTING Public Law 94–142. In 1975, Public Law 94–142,
or the Education for All Handicapped Children
Special children. Although all children are spe-
Act was passed by Congress. This law mandated
cial, we will use the term “special children” as
that a “free and appropriate” education be pro-
used in the literature, namely to signify children
vided to all handicapped children. This neces-
who have some condition that presents, at least
sitated the identification of such children that
potentially, difficulties in their development and
might be eligible for special educational services.
in their learning so that they do not adapt or
The law also included, for the first time, federal
function at what may be considered the “normal”
standards for the use of intelligence tests. It spec-
level.
ified that test results should be viewed as only
one source of information. Furthermore, the law
Need to identify and assess. Recently, there has specified that a test should be administered in the
been a substantial increase in the need to iden- child’s native language, or at the very least, that
tify and assess such children so that they may be linguistic differences should not interfere with
given appropriate assistance. This increase is due the assessment of intelligence. The law however,
to several factors. One is the advance of medi- did not indicate specifically how screening and
cal sciences. Children who years ago would have testing were to be conducted, with the result that
died at birth have now increased survival rates, each educational entity uses different procedures.
but often the result is a child with disabilities. As W. L. Goodwin and Driscoll (1980) pointed
A second aspect is the passage in 1975 of Pub- out, the screening and diagnostic phases have not
lic Law 94–142, the Education for All Handi- been kept distinct. Screening requires tests that
capped Children Act, which mandates a wide are valid, but brief and economical. Diagnosis
range of services for such children, as well as their involves in-depth measurement and examination
right to an education. Finally, parents of special by a team of professionals.
children have become much more vocal in their
demands for comprehensive services for their Public Law 99–457. This law is an extension of
children. Public Law 94–142, but it applies to preschool

223
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

224 Part Three. Applications of Testing

children and infants. Thus, education and school capabilities of the examiner. In the assessment of
services are mandated to handicapped children severity level (e.g., how severe is this child’s hear-
aged 3 to 5. In addition, the law created a new ing impairment), there are two basic approaches
program for children from birth to 2 years; chil- taken. One is the use of a standardized test, where
dren are eligible for the program if they are devel- the judgment of severity is made on the basis of
opmentally delayed or if they are at risk for either how deviant a child’s score is from the mean – for
developmental, physical, or emotional delays. example, an IQ two standard deviations below
Both laws have created an impetus for the test- the mean is taken as possible evidence of men-
ing and assessment of these children, as well as for tal retardation. A second approach is more of a
the development of screening instruments. Quite global or impressionistic procedure in which a
a few instruments are now available, such as the professional judgment is made as to the level of
Denver Developmental Screening Test (Franken- severity present. The norms here are often more
burg & Dodds, 1967), the Developmental Indi- nebulous and may in large part be a function
cators for the Assessment of Learning (Mardell, of the knowledge and experience of the partic-
& Goldenberg, 1975), the Comprehensive Iden- ular clinician. One important aspect of a well-
tification Process (Zehrbach, 1975), the Min- standardized psychological test is that the test
nesota Preschool Inventory (Ireton & Thwing, provides a common yardstick that supplies every
1979), and the Preschool Screening System (P. K. user with an equal amount of experience – i.e.,
Hainsworth & M. L. Hainsworth, 1980). normative data.
A third aspect to consider are the particular
Challenges in evaluation. In testing special chil- conditions for which the child is being tested.
dren, three approaches are typically used. One Motor and visual impairments, for example, may
involves the adaptation or modification of exist- well interfere with their test performance and
ing instruments. Thus, a test such as the Stanford- mask their capabilities. Medications that the child
Binet or WISC-R may be changed in terms of may be taking to control seizures or other medi-
directions, specific items, time limits, or other cal conditions may make the child lethargic or
aspects relevant to the type of handicap present. may cause side effects that can alter test per-
Clearly, this can invalidate the test or, at the very formance. Children who have multiple handi-
least, restrict the conclusions that can be made. caps, such as both mental retardation and visual
Such modifications need to be fully reported impairment, present special testing challenges
when the test findings are used in some type (e.g., Ellis, 1978).
of report, like one a school psychologist might
prepare. A second approach involves the use of Functions of assessment. In this connection,
instruments designed specifically for that type of DuBose (1981) indicates that before selecting
special child. For example, there are tests that tests for an assessment battery, the examiner must
have been developed specifically for use with the determine why the child is being assessed. There
visually impaired or with hearing impaired. A are two main separate functions of assessment
third approach involves combining a variety of of special children: identification-placement
methods and instruments. The problems with and intervention-programming. The first pur-
this approach are often practical – insufficient pose requires standardized and/or criterion-
time available to the examiner, fatigue for both referenced tests acceptable to the local agencies,
examiner and subject, and so on. schools, etc., that determine eligibility for services
In evaluating testing of special children, a and placement. If the testing is being done for
number of issues must be kept in mind. First, intervention and programming purposes, then
diagnostic criteria are often not well formulated, the tests that are selected need to address the
or is there agreement on what differences, if any, concepts and behaviors to be targeted during the
exist among such terms as handicap, disability, intervention. Most likely, these tests will mirror
defect, and others (e.g., Justen & G. Brown, 1977; the curriculum and will be criterion-referenced.
Mitchell, 1973). As we have seen, there is probably no test that
Second, the assessment of the child is inti- is free of limitations and criticisms. The util-
mately related to the procedures used and the ity of a number of tests is severely limited by
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 225

questions of technical adequacy in their devel- (5) school-age measures of “personality” or affec-
opment and in their reliability and validity. Tests tive behavior (Gallagher, 1989).
that may be quite useful in research settings or Simeonsson, Bailey, Huntington et al. (1986)
with normal children may be of limited use with suggest that there are four domains of partic-
special children. Even when tests are appropri- ular cogency to the testing of special children:
ate, the information they yield may be of limited cognition, communication, personal-social, and
value to the development of a treatment plan or behavior. Although these categories are neither
to the prescription of special educational inter- clear-cut nor mutually exclusive, they can serve
ventions. Sometimes two instruments that sup- as a general guide. Under the area of cognition,
posedly measure the same functions may yield we would find many of the intelligence tests dis-
rather disparate results (e.g., Eippert & Azen, cussed in Chapter 5, such as the Wechsler tests
1978; B. L. Phillips, Pasewark, & Tindall, 1978). or the Kaufman ABC. Under the label of com-
Sometimes the same instrument, when revised, munication, we find a number of tests such as
can yield substantially different normative inter- the Peabody Picture Vocabulary, to be discussed
pretations of the same score (e.g., Covin, 1977; below. The area of personal-social includes scales
R. J. Thompson, 1977). of temperament, self-concept scales, locus-of-
In Chapter 2, we discussed the notions of age control scales, personality inventories for chil-
norms and school-grade norms. We noted several dren, adaptive-behavior scales, and many oth-
problems with these norms, which are particu- ers. Finally, under the label of behavior we would
larly acute in the testing of special children. These typically find checklists and rating scales to be
children are often developmentally delayed. Thus completed by parents or teachers, which assess
a mentally retarded 8-year-old child may have overt behavior as distinguished from the traits
language abilities that are more appropriate for a and characteristics under personal-social; a good
normal 3-year-old. But we should not therefore example is the Child Behavior Checklist (Achen-
conclude that the child functions at a 3-year-old bach & Edelbrock, 1981), discussed next.
level. He or she probably functions at different Other authors take different approaches.
levels in different domains, and most likely his or Lichtenstein and Ireton (1984) for example, indi-
her language abilities may show more variability cate that there are three major categories of infor-
of functioning than that of the typical 3-year-old. mation to be looked at in the testing of young
The challenges to the appropriate use of tests with children: physical and sensory functioning,
special children are many, but this is not to say environmental influences, and developmental
that these tests are useless – far from it. Again, we functioning. In the area of physical and sensory
make the point here that psychometrically valid functioning – children who have such conditions
instruments, in the hands of a well-trained pro- as cerebral palsy or vision loss and are typically
fessional, often represent the only objective data first identified by health care professionals such
available, and they can be extremely useful for a as school nurses or pediatricians. It should not
variety of purposes. be assumed, however, that simply because a child
is identified at high risk due to some medical
Domains of assessment. There are all sorts condition, there will necessarily be later devel-
of tests that can and have been administered opmental problems. R. Lichtenstein and Ireton
to special children, and there are many ways (1984) argue that health-related screening tests to
to categorize these tests and/or the underlying be included in a typical preschool screening pro-
domains. One way is to consider the following gram are justified only if: (1) They relate to condi-
five categories: (1) infant scales, used to diagnose tions that are relatively common in the screening
early developmental delays that may occur in cog- population; (2) The condition allows sufficient
nitive, physical, self-help, language, or psychoso- lead time between onset and manifestation to
cial development; (2) preschool tests used to make early identification worthwhile; (3) The
diagnose mental retardation and learning disabil- conditions are relevant to school functioning;
ities and to assess school readiness; (3) school-age and (4) The results have implications for available
tests of intelligence and cognitive abilities; (4) treatment. Two conditions that would meet these
school-age tests of academic achievement; and criteria are vision and hearing impairments.
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

226 Part Three. Applications of Testing

Environmental influences primarily cover the discussed in Chapter 8, the Self-Esteem Inven-
home. A number of studies have documented the tory (Coopersmith, 1967), the Piers-Harris Self-
role of various aspects of the home environment Concept Scale (Piers & Harris, 1969), and the
on subsequent behavior, both in a negative man- Self-Perception Inventory (A. T. Soares & L. M.
ner (e.g., Lytton, Watts, & Dunn, 1986; Richman, Soares, 1975). The majority of self-concept scales
Stevenson, & Graham, 1982; E. E. Werner, Bier- are self-report inventories – the child responds to
man, & French 1971) and in a positive manner stimuli by choosing the one that best represents
(e.g., G. Domino, 1979; Dewing, 1970). himself or herself. These stimuli may be written
Developmental functioning involves a system- statements, statements that are read by the exam-
atic look at the behavior and progress of a iner, drawings of children, “happy” faces, etc. (see
young child in comparison to developmental Coller, 1971; Walker, 1973).
norms. Such developmental functioning involves
a wide variety of domains. Lichtenstein and Age differentiation. From birth to 18 months
Ireton (1984) identify nine such domains: is traditionally considered the infancy stage,
while from 18 months to about 5 years are the
1. Cognitive – e.g., intelligence, reasoning, preschool years. Tests designed for infants and for
memory preschool children usually involve observation,
2. Language – e.g., receptive and expressive performance items, and/or oral administration
3. Speech/articulation – e.g., quality of voice, rather than the “paper-and-pencil” approach,
stuttering and need to be administered individually.
4. Fine motor – e.g., visual-spatial abilities
Comparability of test scores. When we admin-
5. Gross motor – e.g., hopping, skipping, run-
ning ister a test to a special child, how do we know that
the test is indeed valid for this purpose, given that
6. Self-help – e.g., adaptive behaviors such as
a test’s validity may be based upon “normal” chil-
dressing
dren, and given that the test may have been mod-
7. Social-emotional – e.g., temper tantrums, ified in some way for administration to special
passivity children? Willingham (1989) suggests that for a
8. Perceptual and integrative processing – e.g., test to be valid, i.e., fair, with handicapped indi-
learning left from right viduals, the scores on the test must be “compa-
9. School readiness – e.g., those skills and behav- rable” between normal and handicapped, along
iors needed for school eight dimensions:

Obviously, the above categories are not mutually 1. Comparable factor structure. If a test is mod-
exclusive and different taxonomies can be easily ified in some way for administration to special
developed. children, it is possible that the test may no longer
measure the same construct. A factor analysis of
the modified test should yield relevant informa-
Self-concept. Although the area of personal-
tion as to whether the structure of the test has
social assessment (as opposed to cognitive)
changed.
would seem to be of particular importance
with special children, it is one that is relatively 2. Comparable item functioning. Even if the
neglected, with most testing efforts focusing on overall factor structure is the same for the origi-
intellectual capabilities and development. Many nal test and the modified or nonstandard test, it is
authors recommend assessment of self-concept, possible that specific items may be more (or less)
with many studies suggesting that special chil- difficult for someone with a particular impair-
dren are low on self-esteem, which results in poor ment.
adjustment, lack of motivation, a negative self- 3. Comparable reliability. Obviously, the modi-
image, a low tolerance for frustration, and an fied form needs to yield consistent results – i.e.,
unwillingness to take risks. There are of course be reliable.
many well-known measures of self-concept such 4. Comparable accuracy of prediction. Does the
as the Tennessee Self-Concept Scale (Fitts, 1965), nonstandard test discriminate between successful
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 227

and unsuccessful students? Does the test correctly A second issue is whether measurement is pos-
predict academic performance? sible or meaningful with young children. In gen-
5. Comparable admissions decisions. Does the eral, the younger the child, the greater the mea-
very use of a nonstandard form bias the decision surement problems; in addition, the measure-
made? ment of “affective” constructs seems to be more
6. Comparable test content. Do the standard difficult than the measurement of cognition.
and nonstandard test forms have “comparable” A third issue is whether tests are used appro-
content? priately with young children. There is of course
a growing dissatisfaction, both by professionals
7. Are the testing accommodations comparable?
and lay persons, with the use of tests in American
Obviously, handicapped individuals may have
schools, but satisfactory substitutes have yet to
special requirements such as braille forms, need
appear.
for a reader, and so on. But insofar as possible, is
Another issue is whether tests used with young
the situation comparable – i.e. allowing the sub-
children are fair, particularly with regard to eth-
ject to show what he or she can do?
nic minorities. This is a highly emotional issue,
8. Comparable test timing. Even a power test may where often facts are distorted to fit preconceived
have a time limit, if nothing else, imposed by the opinions. The issue is by no means an open-and-
testing situation (e.g., it is almost lunch time). shut one, and there is much controversy. We take
Nonstandard tests often use very different time a closer look at this in Chapter 11.
limits that in effect may not be comparable. Perhaps of greater concern than the above
points is the possible negative consequence of
Although these comparabilities were pre-
labeling a child with a particular diagnostic label
sented by Willingham (1989) in the context
(C. D. Mercer, Algozzine, & Trifiletti, 1979). This
of college-admissions testing, they are equally
also is a highly emotional issue, with little data
applicable to tests used for special children and
available. From a psychometric point of view, to
quite clearly represent ideals to strive for.
test a child merely to obtain a diagnostic label
is like purchasing a house to have a mailbox.
Issues of measurement of young children. Although the mailbox may be “central” and may
Goodwin and Driscoll (1980) outline a num- serve a valuable purpose, it is the house that is
ber of issues that concern the measurement and important.
evaluation of young children. The first issue is
whether young children should be tested. Some Infant intelligence. The assessment of infant
people express fear that the very procedure of intelligence is a difficult task. The examiner must
testing a young child can be harmful and trau- not only establish rapport with both the infant
matic. There does not seem to be much evi- and the adult caretaker, but must be a superb
dence to support this, and most professionals observer of behavior that does not appear on
believe that the potential benefits of profession- command. With older children we can offer them
ally administered tests far outweigh any possible rewards, perhaps reason with them, or use our
negative effects. Often, fears are raised by mis- adult status to obtain some kind of compliance;
informed individuals who imagine the whole- infants, however, do not necessarily respond in a
sale administration of multiple choice exams to cooperative manner.
preschoolers; they do not realize that most tests at Infant intelligence tests are often assessed from
this level are individually administered or require the viewpoint of content validity; that is, do they
an adult informant. Fleege, Charlesworth, Burts reflect a sample of behavior that is governed by
et al. (1992) did report that kindergarten chil- intellect? They can also be judged by predictive
dren being given a standardized achievement test validity, that is, how well they forecast a future
exhibited more behaviors that were stress related criterion. This is a difficult endeavor because
during the test than before or after the test – but intelligence of an infant is primarily exhibited
much of the behavior seemed to be related to how through motor behavior, while the intelligence
the adult authority figures around the children of a school-aged child is primarily exhibited
reacted. through verbal school achievement. Finally, tests
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

228 Part Three. Applications of Testing

of infant intelligence can be judged from the The items in the schedules are organized in
more global aspect of construct validity – i.e., we terms of three maturity zones, namely supine,
can perceive the behavior of an infant as a stage sitting, and locomotion, which serve as start-
that is part of a maturational or developmental ing points for the examination. Developmen-
sequence. tal quotients are calculated for each of the five
Most tests of preschool intellectual function- areas of behavior. These quotients parallel the
ing use similar items, and most of these items go old ratio IQs and use the formula of maturity
back to the work of Gesell and the Gesell Devel- age/chronological age × 100.
opmental Schedules (Gesell, Halveron, & Ama- Because these scales involve an observer,
truda, 1940). This is somewhat peculiar because namely the examiner, we might ask about inter-
Gesell did not believe that intelligence could be rater reliability. The evidence suggests that with
assessed but was more interested in measuring appropriately trained examiners, such reliability
the general maturation or developmental level of can exceed .95 (Knobloch & Pasamanick, 1974).
the child. A number of other scales for infant assess-
We take a brief look at the Gesell Develop- ment have been developed; among these might be
mental Schedules, followed by a good example of mentioned the Griffiths Scales of Development
a scale for infant intelligence, namely the Bayley (Griffiths, 1970) and the Vulpe Assessment
Scales of Infant Development. Battery (Vulpe, 1982).

The Gesell Developmental Schedules. As the The Bayley Scales of Infant Development. One
title indicates, these are developmental sched- of the best known tests of infant development
ules or frameworks, that is, they are not tests is the Bayley (Bayley, 1969). The Bayley is com-
in the strict sense of the word, but a timetable posed of three scales: a mental scale, which
of what is to be expected of a child in five assesses such functions as problem solving and
areas of behavior: adaptive behavior, gross motor memory; a motor scale, which assesses both gross-
movement, fine motor movement, language, and motor abilities such as walking and finer motor
personal-social behavior. The schedules provide skills such as finger movement; and an Infant
a standardized means by which a child’s behavior Behavior Record (IBR) which is a rating scale
can be observed and evaluated; they are partic- completed by the examiner at the end of test-
ularly useful to pediatricians and child psychol- ing and designed to assess “personality” devel-
ogists who need to observe and evaluate young opment in 11 areas such as social behavior, per-
children. sistence, attention span, cooperation, fearfulness,
The scales were developed by Gesell (Gesell, and degree of activity. The scales cover the ages of
Halveron, & Amatruda, 1940) and revised a num- 2 months to 30 months. The mental scale consists
ber of times by various investigators (Ames, Gille- of 163 items arranged chronologically beginning
spie, Haines, et al., 1979; Gesell, Ilg, & Ames, at a 2-month level and ending at the 30-month
1974; Knobloch & Pasamanick, 1974; Knobloch, level. The nature of the items changes with pro-
Stevens, & Malone, 1980). The scales cover an gressive age. At the earliest levels, the items assess
age range of 4 weeks to 5 years and are basically a such aspects as visual tracking and auditory local-
quantification of the qualitative observations that ization. Later items assess such functions as the
a pediatrician makes when evaluating an infant. purposeful manipulation of objects, early lan-
The scales are used to assess developmental sta- guage development, memory, and visual discrim-
tus, either from a normal normative approach, ination. The motor scale consists of 81 items
or to identify developmental abnormalities that also arranged chronologically that assess both
might be the result of neurological or other dam- fine- and gross-motor abilities, such as grasp-
age. In fact, part of the value of these schedules is ing objects and crawling. The test stimuli for
their focus on neuromotor status, which covers the mental and motor scales include a variety of
such things as posture, locomotion, and mus- objects such as a mirror, a ball, a toy car, a rattle,
cle tone, and can provide valuable quantitative crayons, and cups.
information for the identification of neuromo- The Infant Behavior Record consists of 30
tor disabilities. items, most of which are rated on a 5- or
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 229

9-point Likert-type response scale. The items the IBR, there is no total score. Matheny (1980)
assess such variables as the infant’s responsive- developed five subscales on the IBR based on
ness to the examiner, the degree of happiness and factor analysis. He called these task orientation,
fearfulness, attention span, gross bodily move- test affect (introversion-extraversion), activity,
ment, degree of excitability to the environment, audiovisual awareness, and motor skill. The Bay-
sensory activities such as exploration with hands, ley has been quite useful in a variety of areas,
and muscle-movement coordination. including the early detection of sensory and neu-
The Bayley was revised in 1993 and is now rological impairments. As the author herself has
known as the Bayley II. The revision included new pointed out, the scales should be used to assess
normative data based on a representative sample current developmental status and not to predict
of 1,700 children, an extended age range from 1 future ability level (Bayley, 1969). The Bayley can,
month to 42 months, and some item changes, but however, provide a baseline against which later
the nature of the test has not changed any. For a evaluations can be compared to assess develop-
comparison of the original and revised versions mental progress in both mental and motor areas.
see Gagnon and Nagle (2000). Reuter, Stancin, and Craig (1981) developed
a scoring adaptation for the Bayley that leads to
Administration. The Bayley requires a well- developmental-age scores for five domains: cog-
trained examiner, who is not only familiar with nitive, language, social, fine motor, and gross
the scales themselves, but who is comfortable motor. This approach is particularly useful
with infants, and has a good knowledge of what when assessing special children such as mentally
is and is not a normal pattern of development. retarded or those with developmental-language
Although the items are arranged chronologically, disorders.
test items that have similar content, as for exam-
ple involving wooden cubes, are administered Reliability. In general, the reliability of the Bay-
consecutively. As with the Stanford-Binet and ley is acceptable and fairly consistent through-
other tests, there is a basal level and a ceiling level. out the age periods covered. But reliabilities for
The basal level is defined as 10 consecutive items the motor scale tend to be lower for the first
that are passed; the child then gets credit for ear- 4 months. Interrater reliability ranges from 67%
lier items even though they were not adminis- to 100% agreement by two separate raters, with
tered. A ceiling level is defined as 10 consecutive most of the items showing interrater correlations
items that are not passed. Pass-fail criteria for greater than .60 (Matheny, 1980). Split-half reli-
each item are stated clearly in the test manual, abilities for the mental scale range from .81 to .93
but there is still a considerable amount of sub- with a median of about .88; for the motor scale
jective judgment needed. The examiner usually they range from .68 to .92 with a median of .84.
begins by administering items 1 month below the
child’s chronological age, unless there are indica- Validity. The correlations between the mental
tions a lower level might be more appropriate. and motor scales vary considerably and tend to
As with other major tests, ancillary materials are decrease with age; that is, motor and mental
available for the examiner, such as a supplemen- development are different and should be assessed
tary manual (Rhodes, Bayley, & Yow, 1984). separately in young children. Early developmen-
For testing to be successful, there has to be tal functioning has little predictive validity in
good rapport between child and examiner. The terms of later intelligence, except for those chil-
child needs to be interested in the testing and dren who are clearly developmentally deficient.
cooperative so that the testing can be completed. The concurrent validity of the Bayley seems
good, with strong positive correlations between
Scoring. Scores on the mental and motor scales the Bayley and the Stanford-Binet.
are expressed as normalized standard scores with There really is not much validity information
a mean of 100 and a SD of 16, similar to available on the Infant Behavior Record, in part
the Stanford-Binet. These are called the Mental because the behavior measured is rather narrow
Development Index and the Psychomotor Devel- in its specificity to a situation, i.e., the test sit-
opment Index and can vary from 50 to 150. For uation. Thus for example, items on the IBR do
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

230 Part Three. Applications of Testing

not correlate highly with items on another infant special-education programs in schools, with an
inventory as filled out by the mothers (Burg, increased emphasis on the identification of chil-
Quinn, & Rapoport, 1978). In addition there is dren who either have or are at risk for develop-
no total score on the IBR, and single item scores mental handicaps. And only very recently have
are considered too unreliable. researchers begun to explore, in a systematic and
longitudinal manner, the relationship between
Norms. The 1969 scale norms were based on infant behavior and subsequent manifestations
1,262 children, approximately equally distributed (e.g., A. Thomas & Chess, 1977).
between the ages of 2 months and 30 months. Why assess preschool children in terms of
The standardization sample was selected to be social-emotional behavior? Martin (1988) sug-
representative of the U.S. population on a vari- gests four reasons:
ety of demographic aspects, such as geographic
region of residence, gender, race, and educational 1. As an outcome measure, that is, to determine
level of the head of the household. The norms are the effects of a particular set of conditions such
presented separately for different age groups – as stress, a congenital disease, or child abuse, on
from 2 to 6 months in 1/2-month intervals, and the infant;
from 6 to 30 months in 1-month intervals. 2. As a concurrent measure, to describe the cur-
Because children can show dramatic increases rent status of a child objectively, so that the infor-
in abilities within a short time span, such spe- mation can be used by the parents. For exam-
cific norms are both needed and quite valuable. ple, although most children go through a phase
These norms, however, do not include children commonly called “the terrible twos,” there may
who were institutionalized or born prematurely. be need to assess a particular child to determine
whether their oppositional behavior is normal or
Social-emotional behavior. Much of the assess- symptomatic of a more serious condition;
ment of infants and young children has, from a 3. As a predictive measure, to identify children
psychometric point of view, focused on cognitive who are at risk for some condition, so that
development as shown by motoric behavior and therapeutic and/or preventive procedures can be
coordination. The assessment of noncognitive initiated;
aspects that might be labeled “social skills,” “per- 4. As a research tool, to relate infant characteris-
sonality,” “social-emotional behavior,” or sim- tics to other aspects, such as future interactions
ilar labels has seen substantially less emphasis. in the classroom.
There are numerous reasons for this. One is, as
we’ve learned before, tests are developed usually This area of testing is of course very broad and
in response to some need. With infants, there encompasses a wide range of variables, many that
has been a need to identify those children with might be called “temperament,” such as mood,
possible developmental difficulties such as men- degree of persistence, distractibility, activity level,
tal retardation; with school children, the empha- and so on. Assessment of these is typically done
sis has been on behaviors related to learning through observation by the examiner, interview
problems, discipline, hyperactivity, and so on. of the child’s caregiver, or completion of a ques-
Another reason was the active role of behav- tionnaire by an adult familiar with the child.
iorism in American psychology, a movement in The Infant Temperament Questionnaire (Carey
which the emphasis was on behavior that could & McDevitt, 1978) and the Colorado Childhood
be observed rather than such theoretical concepts Temperament Inventory (Rowe & Plomin, 1977)
as personality. Psychoanalysis also has played a are examples of such measures. Unfortunately,
role. Although its emphasis has been on early most of these measures have been criticized for
childhood and on the notion that a person’s per- not meeting basic psychometric standards (e.g.,
sonality was well established in the first 5 years of Hubert, Wachs, Peters-Martin, et al., 1982).
life, its focus was not on psychometrics. Other scales assess social competence, self-help
It is only recently, beginning in the 1960s and skills, or other related aspects. Indeed these may
1970s that there has been an increased focus on well be considered part of the more general label
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 231

of adaptive behavior. The Cain-Levine Social the items (Knoff, 1989; Wirt, Lachar, Klinedinst,
Competency Scale (Cain, Levine, & Elzey, 1977) et al., 1990). In its current format, the PIC consists
is a good example of such a scale. Others that are of 600 true/false items completed by an adult
mentioned in the literature with some frequency informant, typically the child’s mother.
are the Brazelton Behavioral Assessment Scale The PIC scales were developed either through
(Brazelton, 1973), the Rothbart Infant Behavior empirical methods or through rational methods,
Questionnaire (Rothbart, 1981), and the Fullard where expert judges rated specific items as to scale
Toddler Temperament Scale (Fullard, McDevitt, membership. It is a test that is easy to administer
& Carey, 1984). The Brazelton in particular, has and score, but interpretation of the results is quite
been used in a wide variety of studies, includ- complex.
ing the assessment of cross-cultural differences in One of the interesting aspects of the PIC,
neonates (e.g., Brazelton, Robey, & Collier, 1969; from a psychometric point of view, is the
D. G. Freedman & N. Freedman, 1969). However, sequence of items. The first 131 items include
none of these scales come close in popularity and a lie scale and four broad-band factor scales:
psychometric sophistication to the Bayley Scales undisciplined-poor self-control, social incom-
of Infant Development, discussed earlier. petence, internalization-somatic symptoms, and
There are hundreds of such measures avail- cognitive development. The first 280 items
able and, although there are individual excep- include the above scales, a shortened version
tions, most are unpublished, lack basic validity of two other validity scales, a general screening
data, and in particular are weak as to their con- scale of adjustment, and 12 clinical scales, such
struct validity. For a review of 10 such scales, see as achievement, depression, delinquency, anxi-
Bracken (1987). One scale that has found a cer- ety, and social skills. The first 420 items include
tain degree of popularity is the Personality Inven- the four broad-band factor scales, and the full
tory for Children (Wirt, Lachar, Klinedinst, et al., versions of all the validity, screening, and clini-
1990). cal scales. Finally, the entire 600 items contain all
of the above scales plus some 17 supplemental
The Personality Inventory for Children (PIC). At scales. Thus, the examiner in administering the
first glance, the PIC should be described as a test has four choices of increasing length: admin-
behavior-rating scale. It is, however, listed as ister items 1 to 131, 1 to 280, 1 to 420, or all 600.
a personality test in such places as the Mental
Measurements Yearbook because of its rating for- Reliability. Three test-retest reliability studies
mat (true-false) and scale-construction methods are reported in the manual, covering a psychi-
(empirical correlates between scale scores and atric sample and two normal samples, with retest
other independent measures). In fact, the PIC is intervals from 2 weeks to about 7 weeks. The
sometimes characterized as the childhood equiv- mean correlation coefficients are in the low .70s to
alent of the MMPI. the high .80s, with some individual scales show-
The PIC is an objective personality-assessment ing coefficients in the low .40s. One internal con-
inventory developed for use with children and sistency study is reported on a large clinic sample
adolescents, aged 3 to 16, to provide descriptions of more than 1,200 children, with a mean alpha
of “child behavior, affect, and cognitive status, coefficient of .74. Mother-father interrater relia-
as well as family characteristics” (Wirt, Lachar, bilities are also reported; typical coefficients tend
Klinedinst, et al., 1984; 1990). to cluster in the high .50s to mid .60s range.
The test was originally published in 1958, and
norms were then collected between 1958 and Validity. A substantial number of studies are pre-
1962. In 1977, the first manual was published, sented, both in the test manuals and in the lit-
and in 1979, an interpretive guide appeared. In erature, that support the validity of the PIC,
1984, two manuals were published, and the PIC particularly the concurrent, convergent, and dis-
was “revised” in 1990; the revised PIC consists criminant validity. As Knoff (1989) stated, these
of the same items, scales, norms, etc., as its pre- studies provide an excellent basis for the PIC,
decessor, but what was changed is the order of but additional work is needed, especially a more
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

232 Part Three. Applications of Testing

sophisticated and broader-based restandardiza- conflicts, and other negative aspects par for the
tion (i.e., new norms). course.

Norms. The PIC was normed between 1958 and Demands on the examiner. Perhaps more than
1962 on a sample of 2,390 children from the with any other type of client, the testing of special
greater Minneapolis area, with approximately children places a great burden on the examiner
100 boys and 100 girls at each of 11 age levels, and requires a great deal of patience, adaptability,
between the ages of 51/2 and 161/2 years. Addi- and understanding. The examiner needs to be
tional norms were obtained for a small group highly sensitive to the needs and state of the child.
of children aged 3 to 5. For scoring purposes, If the child becomes quite fidgety, for example,
the norms are actually separated into two age it may well signal that a trip to the bathroom or
levels – 3 to 5, and 6 to 16. a change in activity is in order, but the skilled
The norms are based on the responses of moth- examiner should “sense” this before the change
ers whose children were being evaluated. Thus is actually needed (for an excellent discussion see
the completion of the PIC by any other person, Chapter 4 in Kamphaus, 1993).
such as the father, may or may not be mean- Infants of 7 to 10 months of age typically show
ingful in terms of the available norms, partic- anxiety about strangers, and the examiner needs
ularly because the interrater reliability between to be particularly sensitive to this. In general, the
mothers and fathers is not that high. Knoff examiner needs to create a relaxed and informal
(1989) indicates that the original norms are now atmosphere to make sure that the child is com-
“unacceptable.” fortable and not overly anxious, yet also estab-
lish a structured situation in which the necessary
Criticisms. Despite its relative popularity, the tasks get accomplished. The examiner should be
PIC has been criticized on a number of grounds. friendly but calm; too much enthusiasm can scare
The items are responded on a true-false basis, or overwhelm a child. For many children, the
rather than a Likert-type scale, which would give closest out-of-home experience they have had is a
greater variability and perhaps sensitivity to indi- visit to a physician’s office, which may have been
vidual differences. The directions do not specify a difficult and perhaps even traumatic. A visit to a
time frame; the mother is asked to rate her child’s psychologist’s office may resurrect anxieties and
behavior, but there is no indication as to the past fears. Children typically do not understand what
week, past year, since the beginning of the school a psychologist is, or why they are being tested.
year, and so on. It is quite possible that different They may or may not have been prepared prior
mothers may implicitly use different time frames. to the testing; they may perceive the testing as a
Some of the items involve “guessing” (e.g., “My “punishment” for their misbehavior or a reflec-
child is a leader in groups”). These may be minor tion of parental displeasure.
issues, but together they can seriously limit the The examiner needs to be thoroughly familiar
utility of an instrument. with the test materials. Tests such as the Stanford-
Binet and the WISC-R involve a “kit,” a briefcase
The testing environment. Recall that we used filled with blocks, beads, picture plates, and other
the analogy of an experiment as a way of think- materials. These need to be out of the child’s line
ing about tests. This is particularly true in testing of sight so they are not a distraction, but readily
special children, and we want the testing envi- available to the examiner. The examiner should
ronment to be free of extraneous conditions. attempt to present the various test tasks in a stan-
Ideally, the testing room should be well lit, free dard manner, but should be flexible enough to
from noise, distractions, and interruptions. The deviate when necessary, and compulsive enough
furniture should be appropriate for the child. to carefully document such deviations. The entire
A number of authors (e.g., Blau, 1991) have test procedure should “flow” well. (For a useful
described what the ideal testing environment guide to interviewing children see J. Rich, 1968.)
should be like, but the reality is that testing often DuBose (1981) points out that from a develop-
takes place in whatever room is available in the mental testing perspective, most theoretical work
school or clinic, with interruptions, scheduling as exemplified by Jean Piaget, and most applied
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 233

work as seen in the work of pediatrician, Arnold skills, so that the usual testing procedures used
Gesell, has focused on normal children rather with older children and adults may not be appli-
than those who are developmentally impaired. cable. They do not have the personal history and
In the training of psychologists, both in devel- perspective to make social comparisons (e.g., “In
opmental psychology and in psychological test- comparison to others, I am more intelligent”).
ing, the emphasis is on normal development; they Their observational abilities are limited and they
are poorly prepared to assess special children. In may not have the maturity to reflect on their own
fact, a survey of clinical psychology graduate pro- behavior or the family interactions that occur.
grams indicated that training in child diagnostic The behavior of young children is also more
assessment focused largely on intelligence and directly affected by the immediate circumstances.
personality using a narrow range of instruments; Hunger, fear, boredom, fatigue, etc., can all dis-
the author concluded that training at the predoc- rupt a young child’s behavior.
toral level is insufficient in preparing profession- On the plus side, most children are curious
als to assess the competence and special needs of and self-directed; they are eager to explore and
a broad spectrum of exceptional children (Elbert, discover, and may well find the testing situation
1984). interesting and fun.

Characteristics of the child. Quite often, chil-


Test aspects. There are many recommenda-
dren who are referred for testing present a par-
tions concerning the tests to be administered.
ticular challenge in that they may have poor
They should be selected so that the information
self-concept, difficulties in maintaining atten-
obtained answers the referral question. Often
tion, lack of motivation, substantial shyness,
agencies and practitioners use the same standard
poor social skills, and so on. These deficits of
battery for every child, simply because “that’s the
course, affect not only the child’s coping with the
way it’s been done.” The measures selected should
demands of school, but may well affect their test
have a range of difficulty levels and should assess
performance. Thus, establishing good rapport is
cognitive, affective, and psychomotor aspects.
particularly important.
Their selection should reflect the child’s age, the
Children are highly variable. Not only do they
ability to use expressive and receptive language,
differ substantially from each other, but differ
the ethnic-linguistic background, and the exam-
within themselves from day to day and from
iner’s familiarity with specific instruments.
behavior to behavior. Their emotional states are
Blau (1991) suggests that a complete psy-
highly volatile – a child may be very happy at
chological test battery should include tests that
10 in the morning and very unhappy an hour
cover these four areas: (1) the child’s intellectual
later. Children’s behaviors are often specific to a
capacities and learning styles; (2) neuropsycho-
particular situation; a child may have great diffi-
logical development and status; (3) the child’s
culties with school peers but may get along well
achievement levels in major school activities; and
with siblings at home. A particular child may be
(4) personality and character.
advanced motorically but may lag in social skills;
another child may be more easily distracted, and
a third may fatigue easily in the testing situation. Sources of information. In testing young chil-
In fact, it is generally recommended that children dren, three sources of information can be used.
be tested in the morning when they are well rested One involves administering a standardized test
and alert. to the child directly; clearly a major concern here
Young children have a short attention span so would be the reliability and credibility of young
that sustained effort, such as that required by a children’s reports (Bruck, Ceci, & Hembrooke,
test, can be difficult. Children’s responses to a test 1998). Second, we can give the parents a check-
can sometimes be more reflective of their anxi- list or other instrument to fill out, to systematize
ety or attempts to cope with a task they don’t their knowledge of their child. Third, we can use
understand; thus a child may reply “yes” regard- various procedures to observe the child directly,
less of the questions asked. Young children also specifically behavior checklists filled out by the
may have limited (or no) reading and writing examiner or a teacher.
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

234 Part Three. Applications of Testing

CATEGORIES OF SPECIAL CHILDREN Scale (Sparrow, Balla, & Cicchetti, 1984). For an
overview of the measurement of adaptive behav-
We can probably identify some nine major cat-
ior, see Meyers, Nihira, and Zetlin (1979). We
egories of special children, although these cate-
take a closer look at the Vineland as an example
gories are not mutually exclusive and, for some
of an “old” instrument that has been revised.
specific purposes, we may wish to use different
divisions.
The Vineland Adaptive Behavior Scale. The
Vineland is probably the best known mea-
Mental Retardation sure of social competence or adaptive behav-
ior. It assesses self-help skills, self-direction, and
Mental retardation is typically defined as below- responsibility in individuals from birth to matu-
average general intellectual functioning with rity. The Vineland Social Maturity Scale was
deficits in adaptive behavior, and which mani- first published in 1935 and subsequently revised
fests itself during the developmental period. A in 1947, 1953, and 1965. A technical manual
28-year-old who sustains massive brain damage was initially published in 1953 (Doll, 1953).
in a motorcycle accident and can no longer func- Finally, in 1984 it was restandardized, substan-
tion intellectually at his preaccident level would tially changed, and rebaptized as the Vineland
not be labeled as mentally retarded. In terms of Adaptive Behavior Scales (Sparrow, Balla, &
psychological testing, there are two major areas Cicchetti, 1984).
of concern: intellectual functioning and adaptive
behavior. We have already considered intellectual Development. The Vineland was developed at
functioning in Chapter 5. We focus here on adap- the Training School at Vineland, New Jersey, a
tive behavior. well-known institution for the retarded, still in
existence. When the scale was introduced, its
Adaptive behavior. The term adaptive behav- author suggested five ways in which the scale
ior was introduced and defined in 1959 by could be useful: (1) as a standard measure of
the American Association on Mental Deficiency normal development to be used repeatedly for
(retardation); it refers to the effectiveness of the measurement of growth or change; (2) as
an individual in coping with environmental a measure of individual differences, specifically
demands (Nihira, Foster, Shellhass, et al., 1974). extreme deviation that may be significant in
This term was introduced in part because of the the study of mental retardation, juvenile delin-
dissatisfaction of diagnosing mental retardation quency, and related areas; (3) a qualitative index
solely on the basis of intelligence test results. of variation in development in “abnormal” sub-
Thus, mental retardation is currently diagnosed jects such as maladjusted individuals; (4) a mea-
based on both subaverage general intellectual sure of improvement following special treatment;
functioning and deficits in adaptive behavior. and (5) as a schedule for reviewing developmen-
There are however those who believe that adap- tal histories in the clinical study of retardation.
tive behavior is an elusive concept, difficult to Although the Vineland is not an intelligence test,
define and hence to measure, and that retarda- it can be used to obtain developmental data when
tion should be defined only in cognitive terms a child is unable or unwilling to be tested directly.
(e.g., Zigler, Balla, & Hodapp, 1984). Care must be exercised however, in going from
There are a number of scales designed to mea- adaptive behavior to conclusions about cognitive
sure adaptive behavior that are particularly appli- functioning.
cable to mentally retarded children. Among these
scales may be mentioned the Adaptive Behavior Description. The old scale consisted of 117 items
Inventory for Children (J. Mercer & Lewis, 1977), arranged serially in increasing order of difficulty.
the AAMD Adaptive Behavior Scale (Nihira, Fos- The placement of each item is based on the aver-
ter, Shellhass, et al., 1974), the Children’s Adap- age age when the item was achieved by the stan-
tive Behavior Scale (Richmond & Kicklighter, dardization sample; this average age is called the
1980), and the Vineland Adaptive Behavior Scale, life-age. Table 9.1 illustrates the nine areas of
formerly known as the Vineland Social Maturity the Vineland, with illustrative items and their
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 235

Table 9–1. The Vineland Adaptive Behavior Scales The classroom edition is
designed for administration
Area Examples of items Life age
to the teachers and by the
Self-help general Asks to go to toilet 1.98 teacher – that is, it is a
Tells time to quarter hour 7.28
questionnaire completed by a
Self-help eating Drinks from cup or glass unassisted 1.40
Uses table knife for spreading 6.03 teacher. The classroom edi-
Self-help dressing Dries own hands 2.60 tion contains 244 items, some
Exercises complete care of dress 12.38 identical to those of the other
Self-direction Is trusted with money 5.83 forms. The scales and subclus-
Buys own clothing accessories 13.00
ters are the same as on the
Occupation Uses skates, sled, wagon 5.13
Performs responsible routine chores 14.65 other two forms.
Communication Uses names of familiar objects 1.70
Communicates by letter 14.95 Administration. The scale is
Locomotion Moves about on floor .63 to be completed by the exam-
Walks downstairs one step per tread 3.23
iner based on information
Socialization Plays with other children 1.50
Plays difficult games 12.30 obtained from an informant
who knows the subject well.
The manual, however, cau-
corresponding life-age. Each of the items is more tions that this is not a rating scale and scores
fully defined in the test manual. are not to be based on mere opinions. Although
the items are printed in order on the scoring
Revised Vineland. There are three forms of the sheet, the intent is not to follow their order pre-
revised Vineland: a survey form, an expanded cisely, but to adapt the order to the circum-
form, and a classroom edition. Each form assesses stances. This is akin to an interview, where the
adaptive behavior in four areas: communication, information supplied by the informant is “trans-
daily living skills, socialization, and motor skills. lated” into the test items. Thus, despite the appar-
The first two forms also have an optional scale of ent simplicity of the Vineland, the examiner
maladaptive behavior and are suitable for clients needs to be well qualified – in fact, the manual
from birth to 19 years, including low-functioning equates the skills needed here as equivalent to
adults. The classroom edition is used for children the skills needed for administering the Stanford-
aged 3 to 13. Each of the four areas is further Binet. The examiner asks broad questions of the
divided into subareas. For example, the commu- informant (e.g., “How is Amy doing at school?”)
nication area consists of receptive, expressive, and and where necessary, can follow up with more
written communication. specific details. Administration time is about
Within each of the areas, the items are listed in 20 minutes for the classroom edition, 20 to 60
order according to developmental level, and sug- minutes for the survey form, and 60 to 90 min-
gested starting points for each age are indicated. utes for the longer expanded form.
Thus, as we have seen with other tests, there is a
basal and a ceiling level, so that every item need Scoring. Scoring is a bit complicated in that
not be completed. items can be scored as habitually performed, for-
The survey form includes 261 items to assess merly performed, no opportunity to perform,
adaptive behavior and 36 items to assess mal- occasionally performed, and not performed.
adaptive behavior. This form covers ages 0 to 19, Scores at the two extremes are counted 1 and 0
and is available in Spanish as well. respectively, but items scored in the middle cate-
The expanded form is a longer version of the gories may be given full, partial, or no credit. The
survey form, and contains 577 items, including revised edition uses a 2, 1, 0 scale, but retains the
the 297 items of the survey form. Basically, this complexity of scoring. The total score, which is
form offers a more comprehensive assessment of the sum of the items, is converted to an age score
adaptive behavior, but the scales and subscales are (social age) by interpolation using the age place-
the same, except that items are grouped together ment of the items. A ratio social quotient can
not just within scales, but also within subclusters. be calculated by dividing the child’s social age
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

236 Part Three. Applications of Testing

by the chronological age and multiplying by 100. with those on the older edition, i.e., the Adap-
One can also compute a deviation social quotient tive Behavior Composite vs. the Deviation Social
(Silverstein, 1971). Scores on the Vineland Quotient. The two correlated only .55, and were
increase up to age 25. higher in samples of mentally retarded adults and
Scoring also involves a basal and a ceiling score, hearing-impaired children. One would expect
defined as two consecutive passes or failures, the two versions of the “same” scale to corre-
respectively. In the revised survey form edition, late substantially higher, although the authors
ceiling is defined as seven consecutive failures. argue that the revised scale is in fact significantly
In the revised Vineland, raw scores are changed different.
to standard scores (mean of 100, SD = 15) on Similarly, scores on the survey form were cor-
each of the four domains, and the four domains related with the parallel scores on the classroom
are averaged to obtain an Adaptive Behavior edition. This yielded correlation coefficients of
Composite, also with a mean of 100 and SD = 15. .31 to .54; lower correlations were obtained on a
Scores can also be changed to various other sample of preschool children in a Head Start pro-
derived scores such as z scores, percentile rank- gram, and higher correlations were obtained in
ings, stanines, and so on. a sample of mentally retarded students. Because
the classroom edition contains some of the exact
Reliability. Split-half reliability coefficients for items as the survey form, we would expect typi-
the survey form range from .89 to .98 for the cal correlations to be higher, if for no other rea-
Adaptive Behavior Composite, and from .77 to son than item overlap. The relatively low corre-
.88 for the Maladaptive Behavior area. For the lations suggest various possibilities: that the two
four adaptive behavior domains, median split- groups of informants, parents and teachers, per-
half reliability coefficients range from .83 to .90. ceive the same child quite differently; that adap-
Interrater reliability based on a sample of 160 tive behavior is not highly stable across situations,
clients interviewed by two different interview- and what is observed at home is in fact different
ers, with an interval of 1 to 14 days between from what is observed at school; that adaptive
interviews, yielded coefficients in the .70s, except behavior is a meaningful variable for mentally
for the Socialization area which yielded a .62 retarded children but not for normal children;
coefficient. that the two scales use different methodologies
The test-retest reliability of the Vineland seems (interview vs. questionnaire) that create different
satisfactory. Test-retest coefficients with a 2- to results.
4-week interval are in the .80s and .90s. How- Several studies have compared scores on the
ever, teachers and mothers differ in their rat- survey form with other measures of adaptive
ings, with mothers typically reporting informa- functioning. Most of the obtained coefficients are
tion that results in higher social quotients (e.g., in the mid range, from about .20 to .70, with lower
Kaplan & Alatishe, 1976). and higher coefficients obtained depending upon
These coefficients suggest adequate reliability, the nature of the sample, the test form used, and
but some concern remains about the interview other aspects.
method that is at the basis of this scale, i.e., higher Correlations with standard measures of cog-
interrater reliability would probably be obtained nitive functioning are also relatively low, typi-
if in fact the individual items were administered cally in the .20 to .50 range. This, of course, sup-
individually. ports the idea that the Vineland is measuring
something different than, yet related to, cognitive
Validity. A careful analysis of content validity functioning.
was done on an initial pool of 529 items to Construct validity is supported in a number
make sure that the items that were retained of ways. For example, the results of factor analy-
were of appropriate difficulty level, had adequate ses suggest one major factor at various ages and
discriminative ability, and were in correct devel- support the use of the Adaptive Behavior Com-
opmental sequence. posite as an overall measure of adaptive skills.
Criterion-related validity was assessed, in part, Support is also found for the separate use of the
by comparing total scores on the revised edition various areas. It is interesting to note that gender
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 237

differences in the original items were reported to impaired, and emotionally disturbed children in
be so small as to be negligible. residential settings.
In general, the survey form seems to be rel-
atively valid. The expanded form takes most of
Behavioral-Emotional Disorders
its validity by “association” – that is, because the
survey form is valid, the expanded form must This is a broad, ill-defined area of children whose
also be valid. The classroom edition seems to be behavior deviates from what is expected or typ-
the questionable one. (For a review, see Bailey- ical. These may cover problems of conduct, that
Richardson, 1988.) is, children who behave in ways that are unac-
The manual provides considerable informa- ceptable to a particular culture (for example,
tion on the interpretation of scores, as well as the juvenile delinquents, truants), or it may involve
use of the Vineland in conjunction with the K- children who are severely emotionally disturbed.
ABC. This is a plus because the two tests then Such problems may exist within the context of
represent a “package” for which scores can be other conditions, such as mental retardation, or
directly compared. may exist by themselves. One category of chil-
dren sometimes listed under this label are autistic
children. Autistic children have often been con-
Norms. The original Vineland was standardized
sidered “untestable,” but a number of authors
on 620 white male and female residents of New
have in fact shown that testing is not only possible
Jersey, including 10 males and 10 females at each
but quite useful to plan for appropriate treatment
year of age from birth to 30 years. The norms
(e.g., A. F. Baker, 1983; B. J. Freeman, 1976; Rutter,
for the revised Vineland are based on a national
1973). Traditional intelligence tests such as the
sample of 3,000 individuals, with about 100 per-
Wechsler scales, the McCarthy scales, and the
sons in each of 30 age groups between birth and
Bayley have been used with autistic children, as
19 years. The sample was stratified on a num-
well as specialized instruments designed specif-
ber of dimensions according to the 1980 U.S.
ically for this population (e.g., Flaharty, 1976;
Census data. Many of the children in this sam-
Schopler & Reichler, 1979). In addition, some
ple also participated in the standardization of the
excellent resources are available in the literature
K-ABC. These children were basically “normal.”
on assessment of autistic children (e.g., Baker,
In addition, seven samples of handicapped chil-
1983; Wing, 1966).
dren were also assessed. Each sample contained
from 100 to significantly more than 1,000 indi-
viduals. Norms are thus given for several special Learning Disabilites
populations, such as mentally retarded adults in
Perhaps more than any of the other categories,
residential facilities, visually handicapped, and
this is one that is filled with controversy as to
hearing-impaired children. Norms are provided
definition and scope. Usually, these disabilities
at 1-month age intervals for infants between birth
involve difficulties in the understanding and use
and 2 years, at 2-month age intervals for children
of spoken or written language, not associated
aged 2 through 5, at 3-month intervals for chil-
with the other categories (such as mental retar-
dren aged 6 through 8, and at 4-month intervals
dation), but presumably related to dysfunctions
for children aged 9 through 18.
of the central nervous system.
Norms for the classroom edition were devel-
Children with hyperactivity (formally known
oped using a rather different method, which
as Attention Deficit Hyperactivity Disorder)
we need not discuss here (the interested reader
might be listed under this label. Some authors
should consult the test manual by Sparrow, Balla,
believe that traditional standardized testing with
& Cicchetti, 1984).
hyperactive children is of limited value, and
that what is needed is a more multimodal
Use with handicapped children. The Vineland intervention-oriented strategy (e.g., DuPaul,
makes provisions for specialized interpretation of 1992). A number of scales have been developed
handicapped children, and separate scales have to assess this condition, including the Conners
been developed for visually-impaired, hearing- scales discussed below, the McCarney Attention
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

238 Part Three. Applications of Testing

Deficit Disorders Evaluation Scale (McCar- tioning of cerebral palsied children indicate that
ney, 1989) and the Attention Deficit Disorder- about 50% score below an IQ of 70 – that is,
Hyperactivity Comprehensive Teacher’s Rating they are also mentally retarded. Quite often these
Scale (Ullmann, Sleator, & Sprague, 1991). children also have visual-perceptual or visual-
motor problems that compound the challenge of
testing.
Motor Impairments
In assessing a child with motor impairment,
These are neuromuscular conditions that affect the examiner should be aware of any medi-
such aspects as fine motor control affecting hold- cations or other medical conditions, such as
ing a pencil to write, mobility, and posture. Some recent surgery, that might affect test perfor-
of the more common conditions include cere- mance. Because intelligence is so often assessed
bral palsy, a central nervous system lesion that through language, the examiner needs to be par-
leads to motor dysfunction of the limbs; spina ticularly careful not to come to conclusions that
bifida, which is an incomplete closure of the are either incorrect or not appropriate to the
spinal cord that results in a lesion and in some available data. The examiner needs to be sure
degree of neurologic impairment; and muscular that the testing situation is as objective as pos-
dystrophy, which represents a wide range of con- sible. For example, if a parent needs to be present
ditions involving muscle degeneration and weak- during testing, the examiner needs to make sure
ness. The impairment may be present as part of that there is no visual contact between the child
multiple handicaps, or it may exist by itself. and parent that might provide extraneous cues
Cerebral palsy involves “encephalopathic con- or expectations. The examiner needs to be par-
ditions culminating in muscular incoordination ticularly sensitive to the needs of the child. For
and may further include convulsive states, intel- example, handicapped children may be more sus-
lectual deficits, impairment in ability to think, ceptible to fatigue in a standardized testing sit-
and specialized hearing deficits” (Allen & Jeffer- uation. Children with motor impairments may
son, 1962). Although cerebral palsy has a low need special positioning to show optimal perfor-
incidence, with only 1 or 2 per 1,000 children mance (Stephens & Lattimore, 1983).
with this condition, approximately 25,000 chil- These children may find it difficult to respond
dren are born each year with it. In addition, one to tests that are timed or that require the manip-
half to three fourths of these children have speech ulation of objects, such as some of the subtests of
disorders, one half have visual impairments, one the Wechsler. Alternative tests such as the Picto-
third hearing defects, one third convulsive dis- rial Test of Intelligence (J. L. French, 1964), the
orders, about two thirds suffer from severe emo- Columbia Mental Maturity Scale (Burgemeister,
tional stress, and a full 50% are mentally retarded Blum, & Lorge, 1972), or the Peabody Picture
(Barnett, 1982). Vocabulary Test (L. M. Dunn & L. Dunn, 1981)
Children with cerebral palsy, for example, discussed next, are recommended. There are var-
often show a marked degree of incoordination ious measures of gross motor functioning men-
and stiffness, and they often perform motor tasks tioned in the literature. Among these might be
in a slow and laborious manner. For example, listed the Bruininks-Oseretsky Test of Motor Pro-
they may not be able to manipulate test materials ficiency (Bruininks, 1978), the McClenaghan and
as is required in several of the Wechsler subtests Gallahue Checklist (1978), and the Vulpe Assess-
or may not be able to respond within the stated ment Battery (Vulpe, 1982).
time limits. Katz (1955) pointed out, for example, When tests are modified, it is difficult to
that the standard administration of the Stanford- know whether such modifications make available
Binet to children with cerebral palsy underesti- norms inapplicable. Modifications can involve
mated the child’s intellectual abilities “in propor- using eye movements instead of pointing, pre-
tion to the child’s severity of handicap.” Thus care senting items in a multiple-choice format, steady-
must be exercised in selecting a testing instru- ing the child’s hand, and other procedures. Some
ment that does not confound receptive with authors (e.g., Sattler & Tozier, 1970) argue that
expressive skills, for example, or motoric with new norms using the modified procedure need to
verbal responses. Studies of the intellectual func- be established; they reviewed test modifications
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 239

used with various handicapped groups and found able. The score is simply the number of correct
only a handful of studies that assessed such mod- responses. Raw scores are converted to standard
ifications. Most of these studies were judged scores, with mean of 100 and SD = 15.
inadequate, but the findings were of nonsignifi-
cant differences between standard and modified Development. The items were originally
administrations. Regular standardized instru- selected by the author to represent unbiased
ments such as the Stanford-Binet can be used common words used in the United States. Of the
with some children with motor impairments. 300 words contained in the two forms, only 111
In fact, some testers prefer the Stanford-Binet (or 37%) were retained for the revised edition.
over the Wechsler because of its modifiability and Words that had any type of bias, racial, geo-
its higher proportion of verbal over perceptual- graphical, cultural, or regional, were eliminated.
performance items. On the other hand, you recall Subsequent research has shown that these efforts
that the WISC-R yields separate scores for ver- to eliminate potentially biased items were quite
bal and performance scales; this allows an assess- successful, particularly with form M (Reynolds,
ment of difficulties with such aspects as infor- Willson, & Chatman, 1984).
mation processing and attention, often found in
the cerebral palsied individual. Also the fact that Reliability. For form L internal consistency rs
the WISC-R yields a profile of scores based on range from .67 to .88, and for form M they range
subtests allows more directly for an analysis of from .74 to .86. Similar coefficients are reported
weaknesses and strengths (Simeonsson, Bailey, for test-retest, in which one form was followed by
Huntington et al., 1986). a different form. Alternate form reliability on a
Needless to say, when any child, “special” or sample of 642 subjects yielded rs ranging from .71
not, is to be tested for clinical purposes, the exam- to .89. In general then, the reliability indices are
iner should have as much information as possible somewhat lower than ideal. Related to this, the SE
about the child’s background, academic history, of measurement is about 7 points, twice the size
medical status, and so on. Testing is not to be as that found on standard tests of intelligence,
treated as a parlor game where the test is admin- such as the WISC. Bochner (1978) reviewed 32
istered in a blind fashion to see whether the results reliability studies on the PPVT, most done on
correspond with reality. Head Start children. She reported a median reli-
ability coefficient of .72 and concluded that for
The Peabody Picture Vocabulary Test-Revised average children in the elementary grades and for
(PPVT-R). The PPVT-R (L. M. Dunn & L. Dunn, retarded individuals of all ages, the PPVT showed
1981) is one of the better known measures to acceptable equivalence of forms (i.e., alternate
assess receptive vocabulary. It was originally pub- forms reliability) and stability (test-retest).
lished in 1959 and presented as a screening
measure of general intellectual ability, and then Validity. The test manual primarily addresses
revised in 1981. It is nonverbal in nature and content validity, indicating that the test items
yields mental age and IQ indices. The PPVT is were carefully chosen according to various crite-
composed of two equivalent forms, L and M, with ria. Bracken, Prasse, and McCallum (1984) pub-
175 items per form. Each item consists of a page lished a comprehensive review of the PPVT-R
(or plate) with four line drawings. The examiner and indicated correlations in the .70s and low
says a word and the child points to the drawing .80s between scores on the PPVT-R and other
that “matches” the word. The PPVT-R covers the tests of intelligence, although in some individual
age range of 2 to 18, with basal and ceiling rules studies the results were not as positive. The results
to determine which items are administered to the seem to be variable, as a function of the instru-
client. It is not a timed test, takes approximately ments used, the nature of the samples, and other
10 to 20 minutes to administer, and the availabil- aspects. They noted that scores on the PPVT-
ity of the two alternate forms is a plus. The test can R tended to be lower than those obtained on
be used for normal children, but it is particularly the Stanford-Binet or the WISC-R. Thus, the
applicable to special children for whom standard concurrent validity with standard tests of intel-
tests such as the Stanford-Binet may not be suit- ligence is quite good, and sometimes the PPVT
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

240 Part Three. Applications of Testing

has been used as a substitute for a test of gen- the — ,” using the stimulus word. The perfor-
eral intelligence. However, the evidence indicates mance of the normal children remained essen-
that the PPVT-R, although useful as a screen- tially the same on the two forms, but the autistic
ing instrument, should not be substituted for a children showed significant improvement on the
more comprehensive measure of cognitive func- modified form.
tioning. Some evidence suggests that PPVT IQs
are lower than those obtained on the Stanford- The PPVT-R and decision theory. If the PPVT-R
Binet in the case of minority children, but they are is used as a screening instrument, how correct are
higher for children that come from well-educated the decisions that are made? Part of the answer is
and verbally articulate families. supplied by F. B. Hayes and Martin (1986) who
An example of a concurrent validity study is studied the effectiveness of the PPVT-R in iden-
that by Argulewicz, Bingenheimer, and Anderson tifying young gifted children. They tested 100
(1983) who studied a sample of Anglo-American children aged 2 to 6 who had been referred to a
and Mexican-American children in first through university’s preschool assessment project. Refer-
fourth grade. They found the Mexican-American ral meant that someone, typically the parent,
children to score almost a standard deviation thought that the child had exhibited early intel-
below Anglo-American children on both forms lectual and/or language development and could
of the PPVT-R. Only Form L correlated signif- possibly participate in programs for the intellec-
icantly with achievement measures of reading tually gifted.
and mathematics in both groups (.31 and .41 Children were administered both the PPVT-R
with reading, and .29 and .36 with mathemat- and the Stanford-Binet. One of the analyses pre-
ics, for Mexican-American and Anglo-American, sented by the authors assumes that the Stanford-
respectively), with group differences not statisti- Binet IQ is in fact the criterion to be used for pos-
cally significant. sible identification as gifted. Therefore, we can
The PPVT-R has also been used with adults, ask if we used a score of 130 as the cut off score
both mentally retarded (e.g., Prout & Schwartz, on the PPVT-R, how many children would we
1984) and normal (e.g., Altepeter & Johnson, correctly identify? In fact, the hit rate was 69%,
1989). The results are rather mixed and incon- which included 64 children “correctly” identified
sistent, but the conclusion is the same – caution as having Stanford-Binet’s lower than 130, and
should be used when the PPVT-R is used with 5 children with IQs higher than 130. Unfortu-
adults. nately, there were also 31 errors, including a 30%
false negative rate of children who scored lower
Norms. The original normative sample con- than 130 on the PPVT-R but higher than 130 on
sisted of some 4,000 white persons residing in the Stanford-Binet, and 1 false positive. Various
or around Nashville, Tennessee. Norms for the other cut-off scores did not produce highly dif-
revised edition, however, are quite extensive, ferent results. The authors concluded that the use
based on stratified samples of 4,200 children and of the PPVT-R to identify young gifted children
adolescents and 828 adults, according to U.S. is “questionable.”
Census data. The adults were tested in groups
using slides of the test plates. Criticism. In general, the PPVT-R seems to be
a useful measure because it is brief, simple to
Adaptations. Adaptations of the PPVT have administer and score, and essentially nonthreat-
been made. Although the PPVT only requires a ening to the client. The literature is very clear in
pointing response, which is part of the response indicating that its usefulness should be limited to
repertoire of most children over the age of 1, its use as a screening instrument.
there are some children such as autistic children
who do not exhibit such a response. Levy (1982)
Speech Impairments
administered both the standard form and a cut-
up version to 10 normal 4- to 6-year-olds, as well These are a wide variety of conditions and symp-
as 10 autistic 5- to 7-year-old children. In the toms that interfere with spoken communication.
cut-up version, the instructions were to “give me Communication skills are extremely important
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 241

in normal development, and their assessment achieve in the early school years. The test cov-
quite basic. Often such skills are categorized into ers kindergarten to grade 2, and has three forms,
receptive and expressive areas, or in terms of two that are alternate forms, and one that is an
structure of language, contents, and use or con- “applications” form designed to assess mastery
text (e.g., Bloom & Lahey, 1978). Receptive lan- of basic concepts used in combination with other
guage refers to understanding language that is basic concepts. These concepts involve such basic
spoken or written by others. This requires read- notions as left and right, first and last, more and
ing, listening skills, and the understanding of less, whole vs. part, and so on. There is a version
various communication channels such as non- of the test for use with blind children. The Boehm
verbal gestures. Expressive language refers to the can serve as a screening test to identify children
skills necessary to express one’s ideas; this may who have deficiencies in those basic concepts, and
be done by speaking, writing, or gesturing. A therefore might need special attention from the
number of scales have been developed specifically teacher.
to assess such communication skills; for exam- Each of the alternate forms is divided into two
ple, the Receptive and Expressive Emergent Lan- booklets, with each booklet having 3 practice
guage Scale (Bzoch & League, 1971), the Reynell items and 25 operational pictorial items arranged
Developmental Language Scales (Reynell, 1969), in approximate order of increasing difficulty. The
and the Clark-Madison Test of Oral Language child looks at a picture composed of three objects,
(Clark & Madison, 1984). Such skills can also such as an ice cream cone, a piece of pie, and
be assessed in a more global manner by such a shirt, and the teacher asks, “Which of these
tests as the Stanford-Binet. Other useful tests are should a child never eat?” School-aged children
the Detroit Tests of Learning Aptitude (Baker can mark their answers with a pencil or crayon,
& Leland, 1959), the Illinois Tests of Psycholin- while preschool children can simply point. The
guistic Abilities (S. A. Kirk, McCarthy, & W. D. two alternate forms use the same concepts but
Kirk, 1968), the Northwestern Syntax Screening different illustrations.
Test (L. Lee, 1971), the Carrow Elicited Language The manual is quite clear in its instructions,
Inventory (Carrow, 1973), and the Bracken Basic and the black and white line drawings are quite
Concept Scale (Bracken, 1984). unambiguous. The test can actually be adminis-
Early identification of language delay in chil- tered to a small group of children, with the teacher
dren is very important because such children are reading each question, but more typically indi-
likely to exhibit not only linguistic deficits later vidual administration is needed. If a child has a
on, but also academic and social difficulties. The short attention span, the test can be divided into
characteristics of a good screening test for lan- sections, with each section administered sepa-
guage development include short administration rately. Administration of this test does not require
time, assessment of various levels of linguistic a high degree of training or expertise.
functioning, and the ability to measure linguistic
skills rather than academic development (Cole & Scoring. Scoring is straightforward. A single test
Fewell, 1983). A representative test is the Token protocol can be used for all children tested in one
Test (DeRenzi & Vignolo, 1962), which con- class; this permits an analysis of the frequency
sists of a number of tokens, in different shapes, of errors made by the children so the teacher
sizes, and colors. The examiner asks the child to can focus on those concepts that were more fre-
“touch a blue square,” “touch a small one,” and quently missed. Scoring is done by hand, one page
so on. at a time, and takes some 5 to 10 minutes per
booklet; this is a considerable amount of time
The Boehm Test of Basic Concepts. A somewhat when one considers scoring a handful of them.
more restricted test, but one fairly popular in the
literature is the Boehm Test of Basic Concepts. Reliability. The Boehm has reasonable split-half
The Boehm was originally published in 1971 and and alternate forms reliability at the kindergarten
revised in 1986. This test is designed to assess and grade 1 levels (low to mid .80s coefficients),
a child’s mastery of those basic concepts nec- but fares less well at the second grade level – there
essary to understand verbal instruction and to the split-half coefficients are .64 and .73 for the
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

242 Part Three. Applications of Testing

two forms, and the alternate form reliability is Hearing impairments are caused by all sorts of
.65. The reason for these lowered coefficients is etiological aspects ranging from trauma to viral
a ceiling effect, that is, the test becomes so easy infections such as meningitis and, in general,
for second graders that it does not provide differ- affect language skill acquisition more severely
entiation between children, except for the lowest than other skill areas. These special children have
scoring students. The result is that the distribu- very limited, if any, linguistic skills, and testing
tion of scores has a high negative skew and a small presents a particular challenge. In a way, English
variance. is typically a second language for these children.
All the caveats that apply to the testing of minority
Validity. Content validity seems to be the strong or culturally different children, apply particularly
area for this test. Test items were chosen because to hearing-impaired children. Hearing impair-
of their frequency of use in school and by teachers. ment can vary from mild to profound. About
Substantial criterion-related validity is presented two out of three children who are enrolled in
in the form of correlations with other tests such special education programs have either a pro-
as achievement tests; the coefficients range from found or severe hearing loss. Approximately
about .24 to about .64, with a median in the low three of four of these children had a hearing
.40s. The Boehm does seem to be a good predic- loss present at birth. One out of two hearing-
tor of early school success (Estes, Harris, Moers, loss children have additional handicapping con-
et al., 1976). It has been criticized as being cul- ditions, such as cerebral palsy (Meadow, 1983;
turally biased, but Reynolds and Piersel (1983), Scherer, 1983). At the same time, it needs to
in fact, found no empirical evidence for such be pointed out that hearing-impaired children
bias in a study of white and Mexican-American are not a homogeneous group. In one study, for
children. example, significant differences in WISC-R Per-
formance Scale scores were obtained in hearing
impaired children when they were subdivided
Norms. Students from 15 states were tested, but according to the etiology of their impairment
the norms are said to be representative of the (for example, genetic vs. multiple handicapped;
national school population, and were selected Sullivan, 1982).
according to U.S. Census data. The 1983 norms Sullivan and Vernon (1979) present an excel-
are based on approximately 10,000 children. lent overview of psychological tests and test-
ing procedures with hearing-impaired children.
They point out, for example, that most deaf chil-
Hearing Impairments
dren only understand about 5% of what is being
Children with hearing impairments can be classi- said by lipreading – and that such things as a mus-
fied into two broad categories: the hard of hear- tache or beard on the face of the speaker can make
ing and the deaf. The difference essentially lies lipreading even more difficult if not impossible.
in whether their hearing level can or cannot be In testing hearing-impaired children,
enhanced and used. Many other terms have been nonverbal-performance items are essential.
used to try and distinguish among various types Testing should be carried out in a well-defined,
of hearing impairment. One major distinction distraction-free area. The test materials should
is that of congenitally deaf (i.e., born deaf) and be brightly colored, multifunctional, and multi-
adventitiously deaf (i.e., became deaf later in life sensory. The examiner should be demonstrative
because of illness, accident, etc.). Another dis- in gestures and facial expressions, should
tinction is whether the person has hearing that provide demonstrations with sample items, and
is nonfunctional for conducting ordinary aspects manually guide the child through the practice
of everyday life (i.e., deaf) vs. those whose hear- items. The examiner is encouraged to use smiles,
ing is functional with or without a hearing aid touch, and claps to reward the child’s efforts.
(i.e., hard of hearing). Another distinction is Rapport building is particularly important
whether the impairment occurred before or after because hearing-impaired children may often be
the development of language (i.e., prelingually or socially withdrawn, shy, and hesitant (Bagnato
postlingually deaf). & Neisworth, 1991).
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 243

Often, hearing-impaired children are evalu- Table 9–2. Hiskey-Nebraska Tests of


ated by using nonverbal tests that have been Learning Aptitude
standardized on the hearing population – for Appropriate age
example, the performance portion of the WISC- Subtest name range
R or an adaptation of the WPPSI for deaf chil-
Bead patterns 3 to 10
dren (Ray & Ulissi, 1982). There are a number Memory for color 3 to 10
of difficulties with this approach. If only one Picture identification 3 to 10
scale or subtest is used, the obtained informa- Picture association 3 to 10
tion is quite limited. Second, hearing-impaired Paper folding 3 to 10
Visual attention span All ages
and normal-hearing children do not differ solely
Block patterns All ages
in their capacity to hear. They have had different Completion of drawings All ages
experiences with language, which is intimately Memory for digits 11 to 17
related to problem solving as well as a host of Puzzle blocks 11 to 17
other aspects such as social skills. Even though Picture analogy 11 to 17
Spatial reasoning 11 to 17
performance tests are “nonverbal,” they still have
Description of subtests:
verbal components such as the ability to under-
1. Bead patterns: At the younger levels, the
stand instructions or to respond within a time
child strings beads as rapidly as possible. At the
limit. older levels, the child reproduces a string of
McQuaid and Alovisetti (1981) surveyed psy- beads, matches the pattern of round, square,
chological services for hearing-impaired children and rectangular beads, and may do so from
in a portion of the northeastern United States and memory.
found that the Wechsler scales, especially the per- 2. Memory for color: The child selects from
memory one or more color chips to match the
formance scale of the WISC-R, were commonly
chip(s) presented by the examiner.
used. One of the measures also used that has been
3. Picture identification: The child is required
developed specifically for the hearing impaired is to select one of several pictures that matches
the Hiskey-Nebraska Test of Learning Aptitudes, the target picture.
which we discuss next. (For an overview of the 4. Picture association: The child selects a
assessment of auditory functioning, see Shah & picture that “goes with” a pair of pictures
Boyden, 1991.) presented by the examiner.
5. Paper folding: The child imitates from
memory paper folding sequences shown by the
Hiskey-Nebraska Tests of Learning Aptitude.
examiner.
This test was first published in 1941 and was orig-
6. Visual attention span: The child reproduces a
inally designed as a measure of learning ability series of pictures from memory.
for deaf children. In 1955, it was standardized on 7. Block patterns: The child reproduces block
hearing children to provide a measure of intelli- construction patterns.
gence for children who might be at a disadvan- 8. Completion of drawings: The child draws the
tage on highly verbal tests of ability. The age range missing parts of geometric forms or pictures of
covers 21/2 to 171/2, with five subtests applicable to objects.
ages 3 to 10, four subtests applicable to ages 11 9. Memory for digits: The child is required to
reproduce from memory a series of visually
to 17, and three subtests that range across all age
presented numerals, by using black plastic
ranges. Table 9.2 indicates the 12 subtests that numerals.
make up the Hiskey. 10. Puzzle blocks: The child puts together
Instructions on each subtest may be presented puzzle pieces that make a cube.
orally or by pantomime, depending on the child’s 11. Picture analogy: The child completes a
hearing acuity. The items were selected on the visually presented analogy by selecting the
basis of several criteria. They needed to reflect correct answer from five alternatives.
school experience, but to be adaptable for use in 12. Spatial reasoning: The child identifies from
four alternatives the geometric figures that
a nonverbal test and administrate by pantomime.
could be put together to form the target figure.
Performance on the item needed to be correlated
with acceptable criteria of intelligence, and not
be influenced by time limits. In some ways the
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

244 Part Three. Applications of Testing

Hiskey is analogous to the WISC; it is a broad- sample of 41 hearing-impaired children retested


based measure of intellectual functioning, com- after a 1-year interval.
posed of a number of subtests.
Validity. The test manual gives minimal infor-
Development. The Hiskey was standardized on mation on validity, basically addressing only con-
more than 1,000 deaf children and 1,000 hearing current validity for hearing children. Correlation
children, aged 21/2 to 171/2. The hearing sample was coefficients in the high .70s and low to middle
a stratified sample according to parental occupa- .80s are reported with the Stanford-Binet and the
tion as an indicator of social class, but other infor- WISC.
mation is lacking. Most likely, the deaf children B. U. Watson and Goldgar (1985) assessed 71
represent samples of convenience. hearing-impaired children and reported a cor-
relation coefficient of .85 between the learning
Administration. As with most other intelligence quotient of the Hiskey and the WISC-R Full Scale.
tests that cover a developmental span, which tasks Phelps and Branyan (1988) studied 31 hearing-
are presented to the child is a function of the impaired children and found correlations of .57
child’s chronological age. Testing is discontinued with the K-ABC nonverbal scale and .66 with the
after a specific number of consecutive failures. WISC-R Performance score.
Instructions for each task can be presented orally
or in pantomime. Three of the subtests (bead Norms. The norms are based on a standardiza-
memory, block patterns, and puzzle blocks) have tion sample of 1,079 deaf children and 1,074 hear-
time limits on individual items. Total testing time ing children, ranging in age from 21/2 to 171/2. The
is between 45 and 60 minutes. majority of the deaf children attended schools
for the deaf, while the hearing sample was strat-
Scoring. The manual provides clear scoring cri- ified according to parental occupation to match
teria, although they are a bit complicated because U.S. Census figures. Norms for individuals older
each subtest uses somewhat different scoring pro- than 17 are based on extrapolation and cannot
cedures. For each of the subtests, a mental age is be considered reliable.
obtained and a median mental age is then com-
puted as an overall rating of intelligence. The Criticisms. Some authors feel that the Hiskey
median is used on the subtests as well (rather is “psychometrically inadequate and cannot be
than a raw score sum) because the author believes recommended for use” (e.g., Kamphaus, 1993).
that deaf children often tend to score poorly on Some authors do not recommend this test
initial items of a task because they fail to under- because the norms are not representative and
stand completely what they are to do. The median may be outdated. Some suggest that the non-
rating can then be converted to a deviation IQ, verbal scale of the K-ABC is a better instrument
with mean of 100 and SD = 16. For deaf children, that can be administered to deaf children with
however, the scoring procedure is slightly differ- pantomimed instructions (Aylward, 1991). The
ent. The median age is converted to a learning Hiskey continues to be used, in part, because psy-
quotient using the ratio of: chologists working with hearing-impaired chil-
learning age dren have relatively few choices.
× 100
chronological age
Visual Impairments
You will recognize this as the old Stanford-Binet
procedure, abandoned for the deviation IQ. In Here also two major categories may be
fact, the Hiskey seems, in some ways, to be an distinguished: those who are blind and those
adaptation of the Stanford-Binet. with partial vision. If the loss of vision is con-
genital, that is, present at birth, the child may
Reliability. The manual reports split-half coef- lag developmentally in several areas including
ficients in the .90s for both deaf and hearing gross-motor behavior and other visually depen-
children (Hiskey, 1966). Test-retest reliability is dent behaviors such as smiling (e.g., Fraiberg,
reported by B. U. Watson (1983) to be .79 for a 1977). Often these conditions are part of a total
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 245

picture that may involve sensory and emotional The Wechsler scales have also been used with
deficits, disorders of learning and development, the visually impaired, quite often by modifying
or even include mental retardation and/or other items, or by omitting subtests. Despite the fact
deficits. Sometimes blind children are misdi- that the Wechsler scales seem to be the most fre-
agnosed as mentally retarded because standard quently used cognitive test with visually impaired
intelligence tests “discriminate” against visually clients (Bauman & Kropf, 1979), there is relatively
impaired children. Blindness has a great impact little known about the reliability and validity of
on sensorimotor development and has less of an these scales with the visually impaired. In general,
impat on verbal and cognitive skills. the reliability seems to be adequate (e.g., Tillman,
Visual acuity is typically measured by the well 1973), and the concurrent validity as assessed by
known Snellen chart or scale (National Society correlations of the Binet with the Wechsler scales
to Prevent Blindness, 1974) with clients who can is substantial (e.g., Hopkins & McGuire, 1966).
read, or with an adaptation that uses only the let- Many of the subtests of standardized tests
ter E with the “arms” pointing in different direc- such as the WPPSI, the Stanford Binet, and the
tions. Other scales are also available such as the McCarthy Scales assess perceptual-motor devel-
Parsons Visual Acuity Test (Cibis et al., 1985) and opment either directly or indirectly. Other tests
test procedures that assess preferential looking that can be listed here include the Develop-
(e.g., Cress, 1987). mental Test of Visual-Motor Integration (Beery
Hansen, Young, and Ulrey (1982) suggest that & Buktenica, 1967) and the Grassi Basic Cog-
when the professional is testing visually impaired nition Evaluation (Grassi, 1973). Listings of
children, he or she should distribute the test- tests that are appropriate for visually impaired
ing over time rather than doing it at one sitting; individuals can be found in the literature. For
this gives the child time to become familiar with example, Swallow (1981) lists 50 assessment
the surroundings and with the test items, and instruments commonly used with the visually
enables the tester to be familiar with what is nor- impaired.
mal behavior in visually impaired children and
be flexible with the testing procedures.
Children with Chronic Illnesses
Many of the tests used with visually handi-
capped persons have been standardized on nor- There are a number of diseases such as asthma,
mal samples rather than on special children. diabetes, and epilepsy that present special chal-
Where there are norms for visually impaired lenges and may interface with the educational
groups, the samples are typically “atypical” – experience, creating special problems and the
that is, they may be small, unrepresentative, too need for identification and assessment to provide
homogeneous (children with a particular con- ameliorative services.
dition being assessed in a special program), or A child’s understanding of illness in general
too heterogeneous (all types of visual impairment and of his or her own condition, in particular,
lumped together). is an important area of assessment because that
Usually, tests need to be modified for use with understanding is potentially related to how well
visually impaired clients. The modifications can the child copes with aspects such as hospitaliza-
vary from using braille to substituting objects tion and treatment (Eiser, 1984; Reissland, 1983;
for words. From its early days, changes were Simeonsson, Buckley, & Monson, 1979).
made to the Binet-Simon for possible use with
the visually impaired. For example, Irwin (1914)
Gifted
simply omitted those items that required vision.
Hayes (1929) did the same for the Stanford- It may seem strange to have this category among a
Binet and eventually developed the Hayes-Binet, list of categories that primarily reflect deficits or
which was widely used for assessment of the visu- disturbances in functioning, but from an edu-
ally impaired. A recent version consists of the cational point of view this is also a “deviant”
Perkins-Binet (Davis, 1980), which provides a group that presumably requires special educa-
form for children who have some usable vision tional procedures. Giftedness is defined in var-
and a form for children who do not. ious ways, but in most practical situations it is
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

246 Part Three. Applications of Testing

operationalized as high scores on an intelligence SOME GENERAL ISSUES ABOUT TESTS


test. In some instances, evidence of creativity,
Readability. Rating scales, checklists, and self-
originality, and/or artistic talent is also sought
report inventories have become popular screen-
out. Marland (1972) defined gifted children as
ing instruments, used by school psychologists
those capable of high performance in one or
among others, because of their efficiency and
more of six areas: general intellectual ability, spe-
minimal cost. Harrington and Follett (1984)
cific academic aptitude, creative or productive
argue, however, that despite improvements in the
thinking, leadership ability, visual and perform-
reliability and validity of these measures, there
ing arts, and psychomotor ability. Sternberg and
may be a basic flaw – individual respondents,
Davidson (1986) used a fourfold classification of
whether children or adults, may have difficulty
superior abilities: intellectual skills, artistic skills,
reading the instructions and the items. They ana-
niche-fitting skills (e.g., a mechanically inclined
lyzed the readability of 18 widely used instru-
child), and physical skills.
ments including the Child Behavior Checklist (to
Many states require multiple-selection crite-
be discussed) and the Personality Inventory for
ria for the identification of the gifted or talented.
Children, discussed earlier. Their analyses, based
Typically measures of intelligence and of achieve-
on computing various reading difficulty indices,
ment are used, in addition to teacher nomina-
suggested that for the CBCL the average read-
tions. Criticisms of these procedures, however,
ing level required is that of an eighth grader (the
have been made by many in that the process
author of the CBCL indicates fifth grade), while
seems to be dominated by the assessment
for the PIC it was a seventh grader. For the PIC,
of convergent thinking rather than divergent
the authors stated that because of its length, it is
thinking (e.g., Alvino, McDonnel, & Richert,
“arduous if not overwhelming for a poor reader
1981).
to complete.” They propose that the readability
Teacher ratings of the giftedness of their pupils
of a self-report test be considered an essential
have been criticized (e.g., J. J. Gallagher, 1966;
component of a well-designed test.
Gear, 1976), although the limitations may reflect
lack of instruction as to what creativity really is, as
well as problems inherent in the rating forms. In Minimizing verbal aspects. Instruments such
an attempt to make teacher ratings more objec- as the Stanford-Binet or the WPPSI are well-
tive and systematic, Renzulli and his colleagues standardized instruments, and quite useful with
(Renzulli, Hartman, & Callahan, 1971; Renzulli, many types of special children. One of their prob-
Smith, White, et al., 1976) developed the scales for lems, however, is that they tend to be highly verbal
Rating the Behavioral Characteristics of Superior in nature, and special children often have verbal
Students. Unfortunately, subsequent research has deficits. As a result, a number of tests have been
not supported the validity of these scales (e.g., developed that minimize verbal skills by using
Gridley & Treloar, 1984; Houtz & Shaning, 1982; nonverbal items such as matching of patterns.
Rust and Lose, 1980). Several of these tests have also been presented as
A number of authors have developed checklists “culture-fair” tests – that is, they can presum-
of characteristics, traits, or behaviors suppos- ably be used with children in or from different
edly related to giftedness; typical items are “this cultures. Many of these tests, such as the Leiter
child is curious,” “this child is a rapid reader,” International Performance Scale (Leiter, 1952) or
“he or she learns easily and readily” (Denton & the Coloured Progressive Matrices (J. C. Raven,
Postwaithe, 1985; Martin, 1986; Tuttle & Becker, Court, & J. Raven, 1977) have respectable concur-
1980). Unfortunately, most of these checklists are rent validities, often in the .70s range with stan-
no better than the checklists found in popular dard verbal IQ tests. However, scores on these
magazines. They are the fictional product of an tests are often lower than their counterparts on
author’s imagination and biases, devoid of any the Stanford-Binet or WPPSI; caution needs to
reliability and validity. The items are often vague be exercised if such scores are used for place-
or apply to almost any child; quite often they ment purposes. Thus a particular child may have
reflect the behavior of children who are highly an IQ of 76 on the Stanford-Binet but an IQ
intelligent but not necessarily creative. of 68 on a nonverbal test; knowing only the
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 247

score on the nonverbal test might lead to erro- dominated by two such batteries: the Reitan Bat-
neous decisions. That, of course, is not a crit- teries, i.e., the Halstead-Reitan Neuropsycholog-
icism solely of nonverbal tests. Scores on the ical Test Battery for children 9 to 14, the Reitan-
Stanford-Binet and the WISC-R, for example, Indiana Test Battery for children aged 5 through
can be quite discrepant with mentally retarded 8 (Reitan 1969; Reitan & Davison, 1974), and
and learning-disabled children (e.g., A. S. Bloom, the Luria-Nebraska Children’s Battery (Golden,
Reese, Altshuler, et al., 1983). 1981). We illustrate the Luria-Nebraska next.
Other tests use cartoons as the items and ask This approach has been criticized because of the
the child to respond by pointing to a “ther- time and cost requirements, because many of
mometer” response format, happy faces, or sim- the important behaviors are evaluated only in
ilar nonverbal stimuli (for an example see Praver, a cursory manner, and because the procedures
DiGiuseppe, Pelcovitz, et al., 2000). used tend to be redundant (e.g., Goldstein, 1984;
Slomka & Tarter, 1984; Sutter & Battin, 1984).
Testing the limits. This phrase, often associated A second approach consists of using a com-
with the Rorschach (see Chapter 15), refers to bination of traditional psychological and edu-
the notion of psychologically “pushing” the sub- cational tests. Tests such as the K-ABC, the
ject to see what the limits of that person’s abili- Stanford-Binet, and the WISC-R are often used
ties might be. This procedure can be useful with as the main measure, together with other scales
special children, indeed with almost any subject. that may measure oral and written language
Once a test has been administered in a standard- skills (e.g., the Peabody Picture Vocabulary Test),
ized manner, the examiner can return to specific motor and visual-motor skills (e.g., the Develop-
items that the child missed, for example, and mental Test of Visual Motor Integration), aca-
provide cues and encouragement for the child demic achievement (discussed in Chapter 13),
to complete the item. Here the intent is to see and aspects of social-emotional behavior (e.g.,
whether in fact a particular child can use cues the Child Behavior Checklist).
to solve a problem, whether encouragement and A third approach, of course, is to combine
support can allow the child to really show and the two. For example, Bigler and Nussbaum
stretch his or her abilities. Budoff and his col- (1989) describe the test battery used at a neu-
leagues (e.g., Budoff & Friedman, 1964) have in rological clinic in Texas. The battery includes
fact taken testing to the limits and changed it into selected sub-tests from the Halstead-Reitan bat-
a strategy to assess learning potential. teries, the Reitan-Aphasia Screening Battery, the
Wide Range Achievement Test, the Boder Test
Test-retest stability of preschool measures. of Reading and Spelling Patterns, the Durrell
One major concern is the low stability of mea- Analysis of Reading Difficulty, the Beery Test
surement associated with infant behavior. When of Visual/Motor Integration, Raven’s Coloured
cognitive tests are administered to children as Progressive Matrices, the WISC-R, a family his-
they enter school at age 6 and are readministered tory questionnaire, the Child Behavior Check-
a year later, we typically obtain coefficients in the list, the Personality Inventory for Children, pro-
.70s or higher. But with infants and preschool- jective drawings, and a behavioral observation
ers, such correlations over a year-time-span range inventory; additional measures, such as the K-
from zero to the high .50s. Interestingly, these ABC, are included as needed. For a thorough and
coefficients are substantially higher for handi- sophisticated critique of measurement and statis-
capped children (Kamphaus, 1993). The insta- tical problems associated with neuropsychologi-
bility is a reflection of the behavior rather than cal assessment of children, see Reynolds (1989).
the measure, but nevertheless creates problems.
The Luria-Nebraska Children’s Neuropsycholog-
Neuropsychological assessment. There are two ical Test Battery. The Luria-Nebraska assesses
major approaches to neuropsychological assess- brain-behavior relationships in children 8 to 12
ment in children. The first is to use a standard- years. It was first published in 1980 and then
ized battery of tasks that were designed to iden- revised in 1987. The battery is based on the neu-
tify brain impairment. The field seems to be rodevelopmental theory of A. R. Luria, a Russian
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

248 Part Three. Applications of Testing

physician and psychologist (e.g., Luria, 1966), damage. Testing time is about 2 hours. The results
and consists of 11 neuropsychological tests, with can be used in conjunction with other data, or as
3 additional scales developed subsequently (Saw- a baseline against which future evaluations can be
icki, Leark, Golden, et al., 1984). Although the made to assess amount of deterioration or effec-
Luria-Nebraska is modeled on an adult version, tiveness of specific therapeutic procedures such
the tasks used take into account the neurologi- as medications or surgery. Most of the items in the
cal development of children and are not merely battery cannot be used with children who have
downward extensions of adult items. sensory or motor handicaps.
This battery consists of 149 items that assess
functioning in 11 areas, such as motor skills as Scoring. Scoring of test items is based on nor-
for example, touching fingers with thumb in mative data for the 8- to 12-year-old age group.
succession, rhythm skills as in repeating a pat- Each item is scored as zero if the performance
tern of taps, visual skills as in the recognition of is equal to or less than 1 SD below the mean;
objects, and intelligence, with items similar to it is scored 1 for performance between 1 and 2
those found on the WISC-R. SDs below the mean, and scored 2 for perfor-
mance more than 2 SDs below the mean. Thus
Development. The Luria-Nebraska was devel- higher scores indicate more severe deficit. Scoring
oped by administering the adult version to chil- is fairly straightforward and objective, although
dren aged 5 to 12. Initially the authors found that subjective judgment is required in many aspects
children 8 years and older could do a majority of the test. The raw scores for each of the scales
of the procedures in the adult battery, and that are transformed to T scores using the appropriate
for 13- and 14-year-olds the adult battery was table provided in the test booklet.
quite appropriate. The authors therefore decided
to create a children’s battery for ages 8 to 12. This Validity. Neuropsychological assessment of chil-
was done by eliminating difficult items from the dren with learning disabilities is now fairly com-
adult battery, substituting easier items where pos- mon, given that the definition of learning dis-
sible, and adding new items where needed. Three abilities includes minimal brain dysfunction. So
versions of the battery were investigated, and the it is quite appropriate to ask questions about the
fourth version was published (Golden, 1989). validity of the Luria-Nebraska with such chil-
The assignment of each item to one of the dren. Several studies have obtained findings sup-
11 basic scales was done on the basis of the portive of the validity of the Luria-Nebraska
authors’ clinical judgment, followed by a corre- (e.g., Geary & Gilger, 1984; Nolan, Hammeke, &
lational analysis. A further factor analysis on a Barkley, 1983), but primarily with the language
separate sample of brain-damaged and normal and academic achievement subtests, that are ade-
children, seemed in general to substantiate item quately assessed by other instruments (Hynd,
placement. However, further factor analyses of 1988).
each scale alone resulted in the creation of a new A rather interesting study is reported by Snow
set of 11 scales that, judging by their titles, have and Hynd (1985a). They administered the Luria-
some overlap but are not identical with the orig- Nebraska, the WISC-R, and an achievement test,
inal 11 scales. to 100 children who had been previously identi-
Each of the 11 scales is said to be multifactorial fied as learning disabled on the basis of a discrep-
in structure. Each scale covers not just a specific ancy between aptitude and achievement – that is,
skill but a domain of skills in a given area. For they achieved less than what their abilities would
example, the motor scale measures fine motor predict. The authors analyzed the results using
speed as well as unilateral and bilateral coordi- Q-factor analysis. In this type of factor analy-
nation, imitation skills, verbal control of motor sis, rather than seek out the fewest dimensions
movements, and construction skills. among the test variables, the Q-technique anal-
ysis clusters subjects with similar test score pat-
Administration. The Luria-Nebraska requires a terns. The intent then is to identify subgroups
skilled examiner and is typically used to assess of subjects who “belong together” on the basis
children who have known or suspected brain of the similarity of their test scores. The authors
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 249

found statistically three subgroups of children, 100% and specificity is 86.7% (see Chapter 3 if
that included 72 of the 100 children. Unfor- you’ve forgotten what these terms mean).
tunately the three subtypes did not appear to
be markedly different on the intelligence and Levels of interpretation. Golden (1989) points
achievement tests, and the authors concluded out that in interpreting the Luria-Nebraska or
that the Luria-Nebraska has a strong language other batteries, there are several levels of inter-
component across most of its subtests, and that pretation that differ based on both the needs of
therefore its construct validity is poor. the situation and the expertise of the examiner.
The construct validity of the Luria-Nebraska Level 1 aims at ascertaining whether there is sig-
has been questioned both on the basis of the rela- nificant brain injury in the child to differentiate
tionship of the battery to Luria’s theory on which neuropsychological from other disorders. Obvi-
it is based, and also as to its factor structure (Snow ously, if it is known that the child has a significant
& Hynd, 1985b; Hynd, 1988). brain injury, this question is not appropriate, At
this level, the battery is used basically as a screen-
Norms. Initially, the Luria-Nebraska was ing procedure.
normed on 125 “normal” children, 25 at each of Level 2 concerns the description of the child’s
five age levels. behavior – what the child is able to do and not
do. There is no interpretation or integration of
The Luria-Nebraska and decision theory. One the findings, but merely a description. Level 3
basic way to validate a neuropsychological assess- requires the identification of the probable causes
ment procedure is to determine the procedure’s that underlie the child’s behavior. This requires a
ability to discriminate between brain-damaged thorough knowledge of brain-behavior relation-
and nonbrain-damaged individuals. The brain- ships. Level 4 involves the integration of the find-
damaged individuals are usually diagnosed on ings and conclusions into a description of how
the basis of a neurological exam and other evi- the brain of the subject is and is not functioning.
dence, while the normal subjects are usually iden- This involves an understanding of the effects and
tified by exclusion, such as not having obvious implications of specific brain injuries.
head trauma. Several studies have looked at this
with quite positive results. For example, Wilken- Drawing techniques. There are a number of
ing, Golden, MacInnes, et al. (1981) studied a procedures such as the Draw-A-Person and the
sample of 76 brains-damaged and 125 normal House-Tree-Person that involve having the client
controls and found an overall accuracy hit rate of produce a drawing. These techniques were typ-
81.6% for the Luria-Nebraska (91.3% for the nor- ically developed initially as measures of intel-
mal controls and 65.3% for the brain-damaged lectual functioning, but soon became measures
subjects). of personality. These are subsumed under the
Geary, Jennings, Schultz, et al. (1984) studied topic of projective techniques is discussed in
the diagnostic accuracy and discriminant validity Chapter 15. They have been used not only as
of the Luria-Nebraska by comparing 15 learning- assessment devices but also as ancillary proce-
disabled children with 15 academically normal dures to be used in therapy, or as screening pro-
children. The obtained neuropsychological pro- cedures to evaluate readiness for school or the
files were rated as normal, borderline, or abnor- effectiveness of special training programs.
mal on the basis of cutoff scores, with the last two Adults typically become somewhat defensive
categories considered as presumable evidence of when asked to draw, but children do so often with
a learning disability. A comparison of this cate- great pleasure and little discomfort. For children,
gorization based on test results with actual group drawings are a way of portraying the world, as
status, indicated that 28 of the 30 children were they see it and as they wish it were. Thus if the
correctly identified, with two of the normal chil- examiner can distinguish between what is fact
dren falsely identified as learning disabled. Thus, and what is fiction, what is fantasy and what is
for this study, the overall hit rate was 93.3%, the fear, the drawings can become powerful sources
number of false positives was 13.3%, and there of information. In addition, because the struc-
were no false negatives. Sensitivity is therefore ture of drawing is minimal, much can be learned
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

250 Part Three. Applications of Testing

about the way a child responds to authority, to avoid any extraneous aspects that might affect
to ambiguity, and to their own inner resources. the child’s performance. For example, the test
Finally, for the well-trained and sensitive exam- booklet is not to be turned at an angle, and the
iner, drawings can be a source of observation and child begins on the last page and works toward the
information about the child’s impulsivity, self- front of the booklet – this latter avoids impres-
worth, motor dexterity, and so on. sions on subsequent pages. The test is not timed,
Unfortunately, from a psychometric perspec- and typical administration takes about 10 to 15
tive, the picture presented by drawings is quite minutes. The test is relatively easy to administer
different. Although we should be leery of dump- and score, and theoretically at least, it can be used
ing together a set of techniques and evaluating with preschool children through adults, although
them wholesale, reviews of such techniques lead the norms only go up to 1711/12 years. There is a
to the conclusion that although a clinician may be short form of the VMI composed of the 15 easi-
able to arrive at certain conclusions about a child’s est items, for use with children aged 2 to 8, while
emotional disturbance and intellectual level on the full set of items is suitable for children aged
the basis of drawings, such tests seem to have little 2 to 15.
value in assessing personality and/or psychody-
namic functioning (e.g., Cummings, 1986). They Scoring. Each item is scored as passed or failed.
are quite useful to establish rapport, but drawing Passed items may be scored from 1 to 4 points,
skills per se affect the performance, as well as the with higher point values given to the more diffi-
child’s intellectual level and degree of normality. cult designs. The manual provides clear scoring
A number of measures involve the reproduc- criteria, together with examples. Total raw scores
tion of drawings rather than the creation of a pic- are converted to standard scores, with a mean of
ture. Such tasks are, of course, part of standard 100 and SD of 15. These scores can also be con-
tests such as the Stanford-Binet, and indeed one verted to percentiles, T scores, scores with a mean
might question whether a separate test is neces- of 10 and SD equal to 3, an age equivalent score,
sary. But these tests exist and are often used as part and others.
of a comprehensive neuropsychological battery.
The Developmental Test of Visual-Motor Inte- Reliability. Interrater reliability is particularly
gration (VMI) is a good example. important for this test because the scoring is
ultimately based on the examiner’s judgment;
The Developmental Test of Visual-Motor Inte- obtained coefficients range from .58 to .99, with a
gration. The VMI was originally published in median r of about .93. Internal consistency alphas
1967, revised in 1982 and 1989, and is based on range from .76 to .91, with a median of .85, and
the author’s observation that there is a significant test-retest reliabilities range from .63 (with a 7-
relationship between children’s abilities to copy month interval) to .92 (2-week interval), with a
geometric forms and their academic achievement median of about .81. These results suggest that the
(Beery, 1989). The VMI is designed as a screening reliability is adequate and that it increases with
instrument to identify difficulties that a child may well-trained examiners. Unfortunately, the man-
have in visual-spatial processing and/or visual- ual gives little information on the studies done to
motor integration that can result in learning and establish such reliability.
behavioral problems.
The test consists of 24 drawings of geometric Validity. Concurrent validity of the VMI has
designs that the child copies in designated spaces. been assessed by correlating VMI scores with a
The drawings cover a wide range of difficulty, wide variety of other measures of visual-spatial
from a simple vertical line to a six-pointed star. and visual-motor skills, both as tests and as
The items are copied in order, only one attempt behavior (e.g., as in handwriting). Correlations
per drawing is allowed, and erasing is not permit- vary quite widely, from the high .20s to the low
ted. The test is discontinued after three consecu- .90s, but in general do support the concurrent
tive failures. The VMI can be administered indi- validity of the VMI.
vidually or in small groups. Both the test booklet Correlations between VMI scores and stan-
and the instructions have been carefully designed dard tests of intelligence such as the WISC-R
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 251

correlate about the mid .50s, somewhat lower utilize observer information, they can be used to
with verbal than with performance indices. Sim- assess individuals who may not be cooperative;
ilarly, correlations between VMI scores and (5) They reflect observations made over a period
school-readiness tests correlate about .50 as of time in a natural setting, i.e., home or school;
do correlations between measures of academic (6) They capitalize on the judgment and obser-
achievement as, for example, reading; these latter vations of “experts,” those who know the child
ones are somewhat higher for younger children best.
than for older children. These scales can be particularly useful as
Beery (1989) also reports studies of predictive screening instruments to identify children who
validity, for example correlations of VMI scores might benefit from some type of intervention or
at the beginning of kindergarten with later mea- who might need to be more carefully assessed
sures of school achievement. In general, the pre- with individual tests or other procedures.
dictive validity of tests such as the VMI is bet- They also have a number of limitations. One
ter with younger children than older children. of the major problems has to do with interrater
Unfortunately, for most studies reported in the agreement. Low interrater agreement is often
manual there is little information given as to found between parents and between parent and
sample size or other details that would allow the teacher. In a way, this is not surprising because
reader to critically judge the results; many of the the demands and challenges of a home environ-
samples seem quite small and atypical. Whether ment may be quite different from that of the
the VMI measures the integration of the visual classroom. The issue here is not whether one
and the motor functions rather than either sep- source is more accurate than another, but that
arately is debatable (W. L. Goodwin & Driscoll, the obtained information comes from different
1980). sources. A number of rating scales now include
separate versions for the parents and for teach-
Norms. Norms are based on three samples of ers to account for such differences. Higher agree-
children tested at three different times, with the ment is obtained on scales that use items where
total N close to 6,000 children. Only the third a behavior is operationally defined, rather than
sample is “representative,” but the results from require the rater to make some inference. For
the three samples do not appear to be signifi- example, consider an item such as, “gets into
cantly different. These norms cover the ages of 4 many fights” vs. “is an aggressive child”; the
years through 17 years 11 months, although the first item is one of observation, the second item
actual ages of the children tested exceeded these requires an inference.
limits. There may be bias on the part of the infor-
mants. For one, the ratings may reflect a halo
Behavior rating scales. Since the 1970s, there effect – e.g., because Linda is so cute she must
has been increased interest and use of behav- also be bright, outgoing, and well adjusted. Or
ior rating scales as a method of assessment of the raters may be overly critical or lenient in
children. Behavior rating scales essentially pro- their perceptions, or may be “middle of the
vide a standardized format in which informa- road,” unwilling to endorse extreme responses.
tion supplied by an informant who knows the Worthen, Borg, and White (1993) point out that
child well is integrated and translated into some when a rater completes a rating scale there is a ten-
judgment (Merrell, 1994). Thus, behavior rating dency for both recent and more unusual behav-
scales measure perceptions of specified behaviors iors to be given greater weight.
rather than firsthand direct observation. The ratings may also reflect error variance.
Behavior rating scales have a number of advan- There are at least four sources of such variance:
tages. Merrell (1994) lists six: (1) They usually (1) source variance – different informers have
require less training and less time than direct different biases; (2) setting variance – behavior
behavioral observations; (2) They can provide may be situation specific (Billy is horrid in math
information on infrequent behaviors; (3) They class but quite reasonable in art); (3) temporal
are more reliable than some other approaches variance – both the behavior and the informant
such as unstructured interviews; (4) Because they can change from point A to point B; and (4)
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

252 Part Three. Applications of Testing

instrument variance – different rating scales may tor analysis of the nine syndrome scales yielded
measure closely related behaviors but use differ- two “super” factors (called second-order factors)
ent items, wording, and so on. that resulted in the Internalizing and Externaliz-
Rating scales (or checklists) completed by a ing scales.
sensitive informant can yield very useful infor- The CBCL is designed to assess in a standard-
mation. In fact, such scales can often provide ized manner the behavioral problems and social
a better picture than the child’s own percep- competencies of children, as reported by parents.
tion. These scales are easy to administer, can The form can be administered by an interviewer
cover a substantial amount of ground, and can or answered directly by a parent. The checklist
focus on global characteristics as well as spe- consists of 118 behavior problem items (e.g., dis-
cific behaviors. They can be more useful than an obedient at home), each rated on a 3-point scale
interview in assuring that all relevant areas are from not true to often true. There are an addi-
covered and can provide quantifiable informa- tional 20 items that cover social competencies,
tion. Among the better known instruments may such as the child’s involvement in sports, hob-
be mentioned the Behavior Problem Checklist bies, jobs, and friends.
(H. C. Quay & Peterson, 1967) and the Denver The items are clearly written, require about
Developmental Screening Test (Frankenburg & a fifth-grade reading level, and are nontech-
Dodds, 1967). We look at two scales: the Child nical in nature. The manual is clear and well
Behavior Checklist and the Conners Rating Scales written, and the CBCL is easily administered
(for general reviews see Barkley, 1988; Cairns and scored. Administration does not necessar-
& Green, 1979; Edelbrock & Rancurello, 1985; ily require a well-trained examiner, although as
McMahon, 1984). Freeman (1985) points out “a checklist is only as
good as the clinician who uses it.” The CBCL can
The Child Behavior Checklist (CBCL). The Child be hand scored, although the procedure is tedious
Behavior Checklist (CBCL; Achenbach, 1991) and typically requires longer than 15 minutes.
actually consists of six different forms: (1) the Computer scoring programs are available.
CBCL/4-18, which is for use by parents of chil-
dren aged 4 to 18; (2) the CBCL/2-3, for use by Reliability. The CBCL manual reports item relia-
parents of children aged 2 to 3; (3) the Teacher’s bilities greater than .90 between mothers’ reports,
Report form, for use by the child’s teacher, cov- mothers’ and fathers’ reports, and reports from
ering ages 5 to 18; (4) the Youth Self-Report, three different interviewers. Given that item
to be used by adolescents, aged 11 to 18; (5) statistics are often unstable, these are rather
the Direct Observation Form, to be used by an impressive figures. The stability of the CBCL over
observer, after direct observation of the child in a 3-month period is reported as .84 for behavior
the classroom; and (6) the Semistructured Clin- problems and .97 for social competencies. Test-
ical Interview for Children, completed by the retest reliabilities over a 1-week interval are in the
interviewer following an interview, and suitable .80s to mid .90s range, but are lower for longer
for ages 6 to 11. Most of the comments to follow intervals. Interrater reliabilities between teachers
are applicable to the first two forms, although a and teacher aides range from .42 to .72. Interrater
full exploration of reliability, validity, and norms reliabilities between mothers and fathers is in the
has been done on the first form only. mid .60s.
The CBCL yields five scale scores, three
for social competence (activities, social, and Validity. Several studies support the construct
school) and two for behavior problems (inter- validity of the CBCL, such as studies compar-
nalizing and externalizing). Internalizing refers ing “normal” children with children referred to
to “overcontrol” problems while externalizing a clinic. The discriminative power of the test is
refers to “undercontrol” problems. An additional fairly high. By using the 90th percentile of the
nine “narrow-band” syndrome scales are also behavior-problem scores, and the 10th percentile
available; these scales were developed on the basis of the social-competence scores, the authors were
of factor analysis for a sample of almost 4,500 able to correctly classify 91.2% of the normal chil-
protocols of children referred to clinics. A fac- dren and 74.1% of the referred children.
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 253

Concurrent validity results, by way of correla- assess a number of behavior problems rang-
tions with other rating scales, are also quite posi- ing from hyperactivity to bowel problems. Each
tive, with typical coefficients in the high .80s and item is rated by the parent on a 4-point scale
low .90s. The CBCL is a very popular instrument ranging from 0, not at all, to 3, very much
and has been used in hundreds of studies. (all four scales use this format). Factor analy-
sis of this scale yielded six factors labeled as:
Norms. The original CBCL provided norms on aggressive-conduct disorder, anxious-inhibited,
children between ages 4 and 16, but the 1991 revi- antisocial, enuresis-encopresis, psychosomatic,
sion was extended upward through age 18, and a and anxious-immature. However, the actual scale
separate form for 2- and 3-year-old children was is scored on five dimensions: conduct problem,
developed. The norms are based on some 1,300 learning problem, psychosomatic, impulsive-
children, and T scores can be obtained for the hyperactive, and anxiety.
subscales. The Conners Teacher Rating Scale, or CTRS-
39 (Conners, 1969), consists of 39 items that
Evaluation. Reviewers of the CBCL see this cover three areas: classroom behavior, group par-
instrument as “well-documented psychometri- ticipation, and attitude toward authority. It also
cally with adequate reliability and validity” (B. J. includes six subscales such as hyperactivity, con-
Freeman, 1985) and as “one of the best standard- duct problems, and daydream-attention prob-
ized instruments of its kind” (M. L. Kelley, 1985), lem. The longer version (CTRS-48) contains only
and others characterize the CBCL as one of the five subscales, such as conduct problems, similar
most sophisticated and well-researched broad- to those on the CTRS-39, but with different oth-
spectrum behavior Rating Scales (Merrell, 1994). ers such as psychosomatic. The teacher rates each
item on a 4-point scale, identical to the one used
The Conners Rating Scales. Another set of scales by parents. Factor analysis suggests four clusters:
that has also proven popular is the Conners Rat- conduct problem, inattentive-passive, tension-
ing Scales. Conners also developed separate scales anxiety, and hyperactivity. An abbreviated
for parents and for teachers, designed to identify 10-item scale, using the items from the longer ver-
behavior problems in school-aged children. The sion that are most frequently checked by teach-
CRS (Conners, 1990) consists of four scales, two ers, is also available. These 10 items include
parent and two teacher rating scales, that have such aspects as restless, impulsive, short attention
many items in common and are also conceptu- span, easily frustrated, and has temper outbursts;
ally similar. The scales vary in length from 28 the scale seems quite useful in identifying hyper-
to 93 items. Originally developed in 1969 (Con- active children.
ners, 1969), they have been widely used since
then and have been revised as the need arises Reliability. Test-retest for the CTRS-39, at 1-
and as research findings accumulate (Conners, month intervals, ranges from .72 to .91, but drops
Sitarenios, Parker, et al., 1998). For parents then, to .33 to .55 at 1-year intervals (R. A. Glow, P. A.
there is the CPRS-93 (Conners parent’s rating Glow, & Rump, 1982).
scale with 93 items) and the CPRS-48. For teach- Interrater agreement on the CTRS-39 ranges
ers, there is the CTRS-28 and the CTRS-39. The from a low of .39 to a high of .94 on different
CPRS-48 and the CTRS-39 seem to be the most subscales. Agreement between parents on the
widely used versions. The rating scales were orig- CPRS-48 averages in the low .50s. Sandberg,
inally developed within an applied research set- Wieselberg, & Shaffer (1980) reported an alpha
ting, Johns Hopkins University Hospital, and coefficient of .92 for the 10-item Hyperactivity
were intended as norm-referenced instruments Index.
to be used widely. A practical result of this is that
in reading the literature, one finds different ver- Validity. Numerous studies have supported the
sions of these scales, sometimes with different ability of the Conners scales to differentiate var-
results as in the number of factors reported. ious diagnostic groups from their normal coun-
The Conners Parent’s Questionnaire, or CPRS- terparts, such as learning disabled, hyperactive,
93 (Conners, 1970), consists of 93 items that and juvenile delinquents (e.g., Merrell, 1990).
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

254 Part Three. Applications of Testing

Some evidence of their predictive validity is also Description. The McCarthy Scales consist of 18
available. For example, in one study behavior rat- tests grouped into 6 scales: Verbal, Perceptual-
ings obtained at age 7 were highly predictive of Performance, Quantitative, General Cognitive,
hyperactivity at age 10 (I. C. Gillberg & I. C. Memory, and Motor. These scales are overlap-
Gillberg, 1983). Convergent validity data is also ping – for example, the General Cognitive score
available in a variety of studies that have cor- is actually based on 15 of the 18 tests, and thus
related a Conners scale, typically the CTRS-39, represents a global measure of intellectual devel-
with other comparable instruments and have opment. The 18 subtests include block build-
reported significant correlations (e.g., Sandoval, ing, word knowledge (i.e., vocabulary), pictorial
1981). memory (recalling names of objects pictured on
cards), and leg coordination (e.g., motor tasks
Norms. The norms for the CPRS-48 are for chil- such as standing on one foot). Although the
dren aged 3 to 17, while those for the CTRS-39 McCarthy is standardized for children ages 21/2 to
are for these aged 3 to 14. Raw scores on each sub- 81/2, it is probably most useful for children aged 3
scale are converted to T scores, but a total score to 6.
is not computed. The answer sheet is specially The McCarthy Scales use highly attractive test
designed so that scoring requires a minimum of materials, so that the typical child finds the pro-
time. Computer programs that provide adminis- cedure relatively enjoyable. In fact, great care was
tration, scoring, and interpretation of results are taken to build within the testing procedure a
also available. The factor structure and norma- number of steps designed to obtain a child’s opti-
tive data for the CTRS-39 is based on a sample mum performance. For example, several nonver-
of almost 10,000 Canadian children, presented as bal tasks are presented before the child is asked
a stratified and random sample. For the CPRS- to verbalize. When the child is asked to talk, the
48, the sample is much smaller (570 children and required responses are one-word, so the child can
from the Pittsburgh, Pennsylvania, area). overcome what anxiety there might be in talking
to a stranger. As another example, in the middle
Criticisms. Some critics feel that evaluating the of the tests there are a number of activities, such
reliability and validity of the Conners scales is dif- as skipping, designed to give the child a break
ficult because of the many forms, and they believe from the more scholastic type of items.
that other instruments such as the CBCL are The 18 subtests are administered consecu-
psychometrically superior (e.g., Witt, Heffer, & tively, most starting at a beginning level (there is
Pfeiffer, 1990). The response choices of “just a lit- no basal level) and progressing to a point where
tle,” “pretty much,” and “very much” have been a child has made a number of errors. The admin-
criticized as ambiguous and ill defined – what istration, however, allows the examiner both to
may be “pretty much” to one person may be model successful performance and to complete
“just a little” or “very much” to another. Many of tasks that the child cannot to minimize anxiety
the scale items have also been criticized because and frustration.
they are too abstract (e.g., “submissive”), or con- Testing time requires somewhere between 45
tain two separate aspects (e.g., “temper outbursts, and 60 minutes, in part depending on the age of
explosive and unpredictable behavior”). the child. As with tests such as the Stanford-Binet
or the K-ABC, the McCarthy requires a well-
McCarthy Scales of Children’s Abilities. Finally, trained examiner. In fact some reviewers believe
let’s take a look at the McCarthy Scales of that the McCarthy Scales are more difficult to
Children’s Abilities (D. McCarthy, 1972). The learn to administer than the WISC-R or the K-
McCarthy is an individually administered test ABC (e.g., T. Keith, 1985).
of young children’s intellectual functioning, has
been used with both normal and special children, Scoring. The test manual presents scoring crite-
and seems to be a very useful measure, falling ria for each item and provides many examples
short of the Stanford-Binet and the Wechsler tests to minimize the amount of subjectivity involved.
in popularity. Unfortunately, the author died at Scores on the separate scales are normalized stan-
the time that the test was being published, so dard scores with a mean of 50 and a SD of 10,
studies of the test are up to other researchers. and scores can range from 22 to 78. The score on
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

Special Children 255

the General Cognitive Scale, which is called the Census characteristics such as geographic region,
General Cognitive Index (GCI), is a normalized race, urban vs. rural residence, and so on. These
standard score with a mean of 100 and a SD of children were “normal” children – children that
16. Scores on the GCI can range from 50 to 150. were institutionalized or had obvious physical
Incidentally, the McCarthy was one of the first handicaps were excluded.
popular tests of intelligence that did not use the
term IQ. Short forms. A. S. Kaufman (1977) proposed a
A profile can be drawn to summarize the short form of the McCarthy for rapid screen-
six scales, and these scores can be converted to ing of preschool, kindergarten, and first-grade
percentiles. Instructions are also provided for children; Taylor, Slocumb, and O’Neill (1979)
assessing the child’s laterality (i.e., left-or right- have proposed another short form by identifying
handedness), based on the child’s eye and hand the six subtests that correlated most highly with
preference on several subtests. the GCI, in a sample of 50 kindergarten children.
For a comparison of three short forms see Har-
Reliability. Split-half reliabilities for the six scales rington and Jennings, (1986). A set of six subtests
range from .79 to .93, with the GCI averaging .93 has also been published as a “separate” test called
across various age levels. Test-retest reliabilities the McCarthy Screening Test (McCarthy, 1978).
over a 1-month interval range from .69 to .90,
again with the GCI having one of the highest val- Interesting aspects. One interesting aspect of
ues, a not-surprising finding because it encom- the McCarthy is that the various subtests can be
passes most of the subtests. In general, the reli- clustered into different areas of intellectual func-
ability of the GCI and of the first four cogni- tioning than those originally proposed; for exam-
tive scales seems to be adequate. The results on ple, one such area might be visual-organizational
the Motor scale, however, need to be interpreted abilities. Age-equivalent scores can be calculated
cautiously. The reliability of the 18 subtests is not for these areas (A. S. Kaufman & N. L. Kaufman,
reported in the test manual. 1977). In addition to the test manual, much infor-
mation about the McCarthy is available in jour-
Validity. The original test manual gives relatively nal articles and textbooks (one excellent source is
little validity data, but subsequent studies in the A. S. Kaufman & N. L. Kaufman, 1977). For
literature have provided the needed information. a review of the McCarthy Scales, see Bracken
The results of factor analytic studies indicate (1991).
three major factors at all ages: a general cognitive
factor, a memory factor, and a motor factor (e.g.,
A. S. Kaufman, 1975). The results also suggest SUMMARY
that the same task may require different abilities The needs of special children present challenges
at different ages. and require instruments over and beyond those
No significant gender differences have been discussed in Chapter 9 with nonhandicapped
reported and few ethnic differences; socioeco- children. The testing of these special children
nomic status seems more important than race requires a great deal of innovative thinking and
as an influence on performance on this test (A. S. flexibility on the part of all concerned. Yet such
Kaufman & N. L. Kaufman, 1975). The McCarthy innovativeness can in some ways go counter to
correlates moderately to strongly with tests of the basic psychometric canons. What we have
achievement and cognitive skills, both in concur- looked at in this chapter is a wide variety of
rent and predictive validity studies (e.g., Bracken, issues and instruments representing, in some
1981; A. S. Kaufman, 1982; Nagle, 1979). aspects, clear and innovative solutions and, in
others, inadequate attempts that are far from
Norms. The original standardization sample acceptable.
consisted of 1,032 children between the ages of
21/2 and 81/2. At each age level, 1/2-year steps below
51/2 and 1-year steps above, there were approx- SUGGESTED READINGS
imately 100 children, with an equal number of Czeschlik, T. (1992). The Middle Childhood Temper-
boys and girls, and selected in accord with U.S. ament Questionnaire: Factor structure in a German
P1: JZP
0521861810c09 CB1038/Domino 0 521 86181 0 February 24, 2006 14:26

256 Part Three. Applications of Testing

sample. Personality and Individual Differences, 13, 205– and Elliott (1985), who also developed a children’s social-
210. skills measure.
In this study, the cross-cultural validity of a tempera- Witt, J. C., & Martens, B. K. (1984). Adaptive behavior:
ment questionnaire was assessed through factor analysis. Tests and assessment issues. School Psychology Review,
The results do not support the validity of this instru-
13, 478–484.
ment. The author concludes that if temperament research
is to make progress, there is need for sound measuring As the authors indicate, the assessment of adaptive behavior
instruments. was rare prior to the mid 1960s, but today it is almost routine.
The authors review the reasons for this change, discuss various
Glascoe, F. P., & Byrne, K. E. (1993). The accuracy of definitions of adaptive behavior, and point out that half of the
three developmental screening tests. Journal of Early available tests of this construct lack the most rudimentary
Intervention, 17, 368–379. psychometric data.
The authors indicate that developmental screening tests are
widely used for early identification, but few studies look at
the percentage of children with and without problems that DISCUSSION QUESTIONS
are correctly detected, i.e., the “hit” rate. In this article, the
authors assess three such screening tests. 1. How do we know that a test is valid when used
with a “special” child?
Scarr, S. (1981). Testing for children. American Psy-
chologist, 36, 1159–1166. 2. How would you describe the Vineland Adap-
tive Behavior Scale to someone with little testing
Scarr argues that in addition to cognitive functioning, chil-
dren should be assessed as to motivation and adjustment, that background?
these are important components of intellectual competence. 3. How were the items for the Peabody Pic-
Above all, she points out that testing should always be used ture Vocabulary Test originally selected? Can you
in the interests of the children tested.
think of other ways that might be better?
Waksman, S. A. (1985). The development and psy- 4. The Hiskey-Nebraska is characterized as “psy-
chometric properties of a rating scale for children’s
chometrically inadequate”. How might such
social skills. Journal of Psychoeducational Assessment, 3,
inadequacy be remedied?
111–121.
A readable article, illustrating the development of a rating
5. Of the several categories of special children
scale to assess children’s social skills. You might want to com- discussed in this chapter, which might be the most
pare and contrast this article with the one by Clark, Gresham, challenging to test?
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

10 Older Persons

AIM This chapter looks at testing older persons. We first discuss some basic issues
such as defining who is an older person, practical issues of testing, and some gen-
eral comments related to personality, cognitive, and attitude testing. Then we look
at a number of specific areas of relevance: attitudes toward older persons, anxiety
about aging, life satisfaction, marital satisfaction, morale, coping, death and dying,
neuropsychological assessment, memory, and depression.

directly relevant to psychological testing. For one,


SOME OVERALL ISSUES
we know relatively little about the “oldest-old,”
Currently, there is great interest in older per- those over age 80; such information about how,
sons, in part, because the number of older per- for example, mental abilities change might pro-
sons and their relative frequency within the gen- vide useful information in terms of the mainte-
eral population has increased substantially (John, nance of independent functioning vs. more effec-
Cavanaugh, Krauss-Whitbourne, 1999). Exactly tive long-term care.
who is an “older person”? Chronological age is Older people when they are tested by psy-
often used as the criterion, with age 65 as the cut- chologists, often present a complaint that may
off point. Terms such as the young-old (ages 65 have wider ramifications – for example, “dif-
to 74) vs. the old-old (75 years and older) have ficulty on the job” may be related to marital
been proposed (Neugarten, 1974), as related to difficulties, medical problems, lowered self-
various physical and sociopsychological charac- esteem, anxiety about personal competence, and
teristics. The term aging, as J. E. Birren and B. so on. Many of the applications of testing to
A. Birren (1990) so well state, implies something the elderly are problem- or diagnostic-oriented.
that is associated with chronological age but not Thus, for example, there are hundreds of studies
identical with it. Thus the term is used in two on Alzheimer’s in the elderly, but very few studies
ways – as an independent variable used to explain on creativity in the elderly.
other phenomena (e.g., there are changes that At the same time, we should remember that
occur in intellectual processes as one gets older) many older people do experience difficulties that
and as a dependent variable explained by other may affect their test performance. For example,
processes (e.g., lack of support by others cre- it is not unusual for an older person to expe-
ates aging difficulties). In contrast to chrono- rience visual difficulties that may range from
logical age, functional age has been suggested quite severe to simply annoying, such as taking
(Salthouse, 1986), i.e., the person’s ability to be a bit longer to adapt to a change in illumina-
involved in directed activities, intellectual pur- tion. These difficulties may affect their perfor-
suits, and so on. mance on particular tests, especially when some
The fact that the population of the United visual component of the test, such as print size is
States is aging presents a number of challenges important.
257
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

258 Part Three. Applications of Testing

Testing problems. D. Gallagher, Thompson, and must be clearly heard. The test must have face
Levy (1980) point out that among the major validity in that the client must feel that the test is
problems involving the use of tests with older per- appropriate and useful.
sons are: improper standardization, lack of nor- Good rapport is also quite important. Many
mative data, poor reliability and external valid- elderly individuals may not be “test wise,” and
ity, ambiguous instructions, inappropriate items, may find multiple-choice answer sheets and sim-
and inability of tests to discriminate at lower lev- ilar forms confusing; at the same time, they may
els of functioning. They also point out that tradi- not be willing to admit their quandary to a
tionally, clinical psychologists have been trained younger examiner. Often, older individuals are
to evaluate intellectual processes, personality, tested clinically because of the possible presence
psychopathology, and other relevant areas, but of some problem or condition – dementia, for
that in testing older persons there is need to focus example – and the individual needs to be reas-
on such aspects as physical health, leisure-time sured so that anxiety about the testing and the
use, and life satisfaction. potential findings does not interfere with the per-
formance, and so that self-esteem is not lowered
Equivalence of tests. Many measures used with by what may be perceived as failure or incom-
older persons were originally developed using petence (Crook, 1979). Ideally, the testing situ-
younger samples, and so a basic question is: Are ations should be nonthreatening and the client
such tests equivalent when used with elders? By should leave feeling a sense of accomplishment
equivalent we mean that they should be reliable and enhanced self-worth.
and valid when used with older persons and, to a Although many people, particularly younger
lesser degree, whether such aspects as item diffi- adults such as college students, are relatively com-
culty and factor structure are also the same. A fortable taking tests, the elderly can be intim-
different factor structure in middle-aged sam- idated by them, and they may act in highly
ples and in elderly samples would of course not cautious ways that might negatively affect their
necessarily invalidate a test. Keep in mind that performance.
a factor structure is basically a statistical “fic- Fatigue is also a problem. Often older persons
tion” we impose on reality to attempt to explain are assessed with a multivariate test such as the
that reality. Different factor structures in differ- WAIS, or an entire test battery composed of many
ent samples may in fact reflect important aspects such tests, and the examiner needs to be sensitive
of reality. to this aspect. If fatigue takes place, the client
Most tests now used with older persons show may perceive the testing procedure as a failure
equivalence in reliability and validity. Early stud- on his or her part and may feel depressed and/or
ies of the equivalence of factor structure were con- inadequate. Tests for the older person need to be
tradictory, but recent studies with more sophisti- brief, not only because of fatigue but also because
cated techniques of confirmatory factor analyses of a briefer attention span (Wolk, 1972).
seem to suggest equivalence in factor structure as Aiken (1980) suggested eight procedures to
well (Hertzog & Schear, 1989). be implemented when testing older persons:
(1) Give ample time for the client to respond, (2)
Practical issues. In Chapter 9, a number of prac- Give practice items, (3) Test in several short ses-
tical issues were discussed with regard to testing sions rather than a few long ones, (4) Recognize
children. Many of these issues apply to older per- and be sensitive to fatigue on the part of the client,
sons as well. The test environment, for example, (5) Be aware of and make appropriate accommo-
needs to have adequate lighting. Many elderly dations for any sensory deficits (such as hearing)
individuals have hearing impairments, so direc- the client may have, (6) Make sure the testing
tions need to be clearly audible, and the test- environment is free of distractions, (7) Give lots
ing room needs to be free of those acoustical of encouragement, and (8) do not pressure the
problems that can affect the understanding of client to continue if the client refuses.
instructions. The test format must be suitable for
the physical limitations the clients may have – Self-assessment vs. performance-assessment.
printed items need to be in print large enough Suppose that we wanted to assess someone’s abil-
to be read comfortably, and oral administration ity to swim. We could ask them a relevant set
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 259

of questions, either in interview form, or as a For example, a number of studies have been done
rating scale, e.g., Can you swim? How good a on the MMPI with older persons, with the gen-
swimmer are you? Are you certified by the Red eral conclusion that the scales measuring somatic
Cross? and so on. This type of assessment is called complaints (scale 1), depression (scale 2), denial
self-assessment and recently, a number of writ- (scale 3), and social introversion (scale 0) are gen-
ers have questioned the validity of such meth- erally higher in the aged, and scales measuring
ods. We could also assess the person’s ability to rebelliousness (scale 4) and energy level (scale
swim by having them swim. This is called perfor- 9) are generally lower (Gynther, 1979; Swenson,
mance assessment, also variously called “direct,” 1985).
“objective,” “behavioral” assessment (E. L. Baker,
O’Neill, & Linn, 1993). Cognitive functioning. In Chapter 5, we dis-
At first glance, such performance assessment cussed tests of cognition. Standard tests of adult
would seem to be superior to self-assessment. A. intelligence, such as the WAIS, are certainly appli-
M. Myers, Holliday, Harvey, et al. (1993) com- cable to older adults. The challenge is not so
pared the two methods in a sample of adults, much in the test itself but in the available norms.
aged 60 to 92. They asked these adults to com- Most tests of cognitive functioning are normed
plete a series of tasks that included measures of on younger adults, and the norms may not be
motor capacity (e.g., using a dynamometer, mov- appropriate for older adults. Secondly, compar-
ing limbs, etc.); of manual ability (e.g., spooning isons of adult groups of various ages to assess, for
beans into a can); self-care activities (e.g., using example, intellectual decline with advancing age,
the telephone, writing a check), and other activi- often compare groups that are different in more
ties, such as opening a medication container, fol- ways than just age. In the United States, for exam-
lowing a simple recipe, and picking up a penny ple, older groups tend to be less educated and
from the floor. These tasks were all administered more likely to be of immigrant background than
in a standardized form. They also asked the par- younger groups. It is also important to remem-
ticipants to complete a self-assessment question- ber that for many categories of tests, such as tests
naire that addressed the same activities. They of cognitive functioning and rating scales, their
found that although 182 subjects were willing reliability and validity is closely dependent on the
and able to complete the self-assessment ques- skills of the examiner who uses the scale (Overall
tionnaire, only 99 attempted at least one of the & Magee, 1992).
performance tasks. These authors reported that
the performance measures were not more accept- Values and attitudes. In Chapter 6, we discussed
able to the participants than the self-report mea- the measurement of values and attitudes, and
sure, that some participants found some of the much of our discussion is applicable to older
tasks silly or demeaning, and that some found the persons. A typical example of the type of study
procedure of being observed as they undertook a that has been done with elders is that by Kogan
task rather disruptive. Based on this and a vari- and Wallach (1961) who used the semantic dif-
ety of other analyses, the authors concluded that ferential to assess age changes in attitudes and
functional performance measures do provide dif- values. They compared the responses of 137 col-
ferent information but should not be viewed as lege students vs. those of 131 older men and
psychometrically superior. women, members of a gerontological research
association, with a mean age of about 70 to 71.
Tests of personality. In Chapter 4, we discussed The two samples were similar in education and
tests of personality, and much of what was dis- verbal-intelligence level. The semantic differen-
cussed there is applicable to the testing of older tial used in this study consisted of 25 bipolar pairs
persons. Most tests of personality such as the of adjectives; the participants rated 28 different
MMPI, the CPI, and the Edwards Personal Pref- concepts representing work and leisure, majority
erence Schedule, are applicable to the elderly, and minority groups, family and interpersonal
although in some cases (such as the EPPS) there relations, self-concept, and other areas. A factor
is the question of the appropriateness of the avail- analysis yielded a strong evaluative factor (as one
able norms. There are also changes that do occur would expect). Approximately one third of the
as a function of aging and concomitant aspects. concepts yielded significant age differences, with
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

260 Part Three. Applications of Testing

older subjects rating concepts like “retirement” of college students toward old age. This scale con-
and “old age” more favorably and concepts such sisted of 137 statements that covered 13 different
“future” and “life” less favorably. categories such as “conservatism” (e.g., they are
Another concept that we discussed in set in their ways), “mental deterioration” (e.g.,
Chapter 6 was that of Guttman scales. This type they are absent-minded), and personality traits
of scaling has been used with the elderly, particu- (e.g., they are kind). The items were developed
larly in the context of assessing “activities of daily through a series of unstructured interviews with a
living” or degree of disability present. For exam- small sample of adults, as well as discussions with
ple, Katz, Ford, Moskowitz, et al. (1963) devel- social workers, study of case records of elderly
oped a hierarchy of activities of daily living with clients, and a review of the literature. However,
seven levels: no item statistics are given as to how the final
items were selected.
1. those without any disability;
The scale can be administered in a group set-
2. those with one disability; ting. There is no time limit and completion typ-
3. those with two disabilities who have difficulty ically takes 15 to 30 minutes. The authors exper-
bathing; imented with a response scale of 0 to 100, where
4. those with three disabilities, including diffi- the respondent was instructed to use these num-
culty in bathing and dressing themselves; bers as percentages, i.e., “If you think that 90% of
5. those with four disabilities including difficulty older people are characterized by this statement,
in bathing, dressing, and toileting; give that statement a response value of 90.” The
6. those with five disabilities, including difficulty authors concluded that a yes no response format
in bathing, dressing, toileting, and transferring was preferable, as it took less time and simplified
from bed; the instructions.
Although the scale and various subsets of items
7. those who have difficulty performing all activ-
were used in a variety of studies with samples
ities of daily living.
ranging from college undergraduates to older
This scale has been useful in studies of older per- persons, the authors did not seem particularly
sons (e.g., Siu, Reuben, & Hayes, 1990), although concerned about reliability, and so little reliability
some authors question the invariance of such evidence is available (Axelrod & Eisdorfer, 1961;
a progression (e.g., Lazaridis, Rudberg, Furner, Bekker & Taylor, 1966; Kilty & Feld, 1976; Lane,
et al., 1994). Another area of the elderly where 1964; Tuckman & Lorge, 1953; 1958). In one
Guttman scales have been useful is in the study study, the scale was administered at the beginning
of morale (see Kutner’s seven-item morale scale of a course on the psychology of the adult and a
in Kutner, Fanshel, Togo, et al., 1956). subset of 30 items readministered with the final
examination; the obtained r was .96. Spearman-
Brown reliability coefficients from .73 to .88 are
ATTITUDES TOWARD THE ELDERLY
also reported.
One area of research that has generated a number Although this scale has been used in one form
of instruments focuses on attitudes toward older or another in a variety of studies, the studies were
persons, on the part of younger people as well as typically “isolated” and do not present the cohe-
older individuals themselves. One technique that sive portrait required by construct validity. In
has been used quite frequently is the semantic part, the difficulty may lie in the lack of a well-
differential, as illustrated in the Kogan and Wal- articulated theory about attitudes toward older
lach study (see also Eisdorfer & Altrocchi, 1961; persons. Axelrod and Eisdorfer (1961) adminis-
Rosecranz & McNevin, 1969). Even more popu- tered the scale to a class of college students, with
lar is the use of Likert-type scales, using the stan- random fifths of the class asked to respond to
dard Likert responses (strongly agree to strongly different age groups (35, 45, 55, 65, and 75). If
disagree) or some other format. we assume that the negative stereotype of aging
One of the earliest scales using a yes-no format increases with the age of the stimulus target, then
was by Tuckman and Lorge (1952). Originally, the sensitivity of the scale to such increases could
their scale was developed to investigate attitudes be seen as evidence of the construct validity of
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 261

Table 10–1. Multidimensional Model of Anxiety About Aging (Lashen & Faulkender, 1993)
Fears

Dimensions of of aging of being old of old people


anxiety (process of aging) (state of being old) (perception of others)
1. Physical – e.g., perceived changes in physical appearance as one gets older; worries about health
2. Psychological – e.g., self-esteem and life satisfaction; degree of personal control; fear of memory loss
3. Social – e.g., economic and social losses; worries about retirement
4. Transpersonal – e.g., search for meaning of life; religious issues

the scale. In fact, 96 of the 137 items showed such Although these results do not support the ini-
monotonic increases in percentage endorsement. tial theoretical framework, the authors felt that
the obtained 20-item AAS is a potentially useful
scale. Total scores on the scale show higher mean
scores on the part of males than females, and the
ANXIETY ABOUT AGING
four factors intercorrelate with each other, but
Lasher and Faulkender (1993) proposed that anx- not substantially, with coefficients ranging from
iety about aging is a separate dimension from .20 to .39. Scores on the AAS correlate signif-
other forms of anxiety such as death anxiety or icantly with two other measures of aging, and
state-trait anxiety. They felt that this concept of the overall pattern does suggest some construct
“aging anxiety” has importance both because it validity.
helps to understand how we react to the elderly,
and it has not been adequately researched.
LIFE SATISFACTION
These authors began the development of their
scale with a theoretical model composed of two One of the active areas of research with older per-
major dimensions, as illustrated in Table 10.1 The sons has to do with their perceived life satisfaction
intersection of the two dimensions yields 12 cells, (subjective well-being, happiness, etc.). There are
and the authors used this theoretical blueprint to a number of interesting issues in this area, such as
generate 7 items per cell, for a total of 84 items. how happy are older Americans relative to some
These items used a Likert 5-point response scale, comparison group, what are the sources of sat-
with half of the items phrased positively and half isfaction and dissatisfaction, and what are the
of the items phrased negatively. The authors then dimensions and relative importance of such life
asked three psychology graduate students to sort satisfaction (Doyle & Forehand, 1984).
the items in terms of the two dimensions. This A number of studies have found a negative
was done twice and the authors concluded that relationship between age and self-reported hap-
no new items were needed, although no statistical piness, that is, life satisfaction decreases with
data is offered in support of this conclusion. Note age (e.g., Bradburn & Caplovitz, 1965; Robin-
that this represents a variation from the usual son & Shaver, 1973). However, other researchers
procedure where a pool of items is first developed (Herzog & Rodgers, 1981) have reported that
and then a subset of the “best” items is chosen, such relationship can be positive, that obtained
on the basis of some preliminary decision rules. correlation coefficients tend to be small, and
The 84-item Aging Anxiety Scale (AAS) was that whether the results are positive or negative
then administered to 312 volunteers, ranging in tends to be a function of how the variables are
age from below 25 to over 74. A series of fac- defined. For example, A. Campbell, Converse,
tor analyses were performed, and finally 20 items and Rodgers (1976) found that younger people
were retained, reflecting 4 factors, each com- reported feeling happier than older persons, but
posed of 5 items: fear of old people; psychological reported lower life satisfaction.
concerns; physical appearance; and fear of losses A number of investigators have attempted to
(it is interesting to note that only 6 of the 20 items define and measure the psychological well-being
are worded negatively and that 5 of these 6 occur of older people, quite often with the intent of
on the fear-of-losses factor). using such a measure as an operational definition
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

262 Part Three. Applications of Testing

of successful aging. Two basic approaches seem in fact, interviewed. Interviews can of course vary
to have been used. In one, the focus has been the in length, topics covered, and so on, so that the
overt behavior of the individual, and the criterion reliability and validity of the scale is intimately
has been one of success or competency, usually bound with the interview procedure. The authors
in a social context. A second approach focuses on therefore developed two self-report instruments
the person’s internal frame of reference, i.e., the by selecting 60 cases and analyzing the inter-
person’s own evaluation of satisfaction, happi- view materials of those who were high scorers
ness, accomplishments, and so on. Many studies, and those who were low scorers on the initial life
of course, combine the two approaches. satisfaction measure. They also added items as
Neugarten, Havighurst, and Tobin (1961) needed, and the results were two scales: (1) the
developed a set of scales for rating “life satisfac- Life Satisfaction Index A consisting of 25 attitude
tion,” using a sample of 177 adults aged 50 to 90 items for which an agree or disagree response is
that were participating in a longitudinal study. required (e.g., These are the best years of my life);
These individuals were interviewed regarding and (2) the Life Satisfaction Index B consisting
their life activities – such things as what they did of 17 open-ended questions and checklist items
every day and on the weekend, questions about to be scored on a 3-point scale [e.g., As you get
work, religion, attitudes toward illness, etc. A older, would you say things seem to be better or
rational analysis of the available literature led the worse than you thought they would be? – bet-
investigators to postulate that psychological well- ter (2 points); about as expected (1 point); worse
being was composed of: zest vs. apathy, resolution (0 points)]. These scales were administered to 92
and fortitude, congruence between desired and respondents along with an interview. The scales
achieved goals, positive self-concept, and mood were then revised with the result that scores on
tone. An individual was seen as having psycho- scale A correlated .55 with the life satisfaction rat-
logical well-being to the extent that he or she took ings, while scores on scale B correlated .58 with
pleasure from everyday activities, regarded life as the life satisfaction ratings. Correlations of the
meaningful, felt success in achieving major goals, scale scores with a clinician’s ratings based on
held a positive self-image, and maintained happy interview were .39 and .47, respectively. (For a
and optimistic attitudes and mood. critique of the Life Satisfaction Index, see Hoyt
The interview data was then used as the basis & Creech, 1983).
for rating each of the five components on a 5- Lohmann (1977) administered seven of the
point scale; total scores could then range from most frequently used measures of life satisfac-
5 to 25, with higher scores indicating greater life tion, adjustment, and morale, including the two
satisfaction. For the sample of 177 interviews, the scales discussed above, and the Lawton scale to be
interrater reliability for the total score was .78, discussed below, to a sample of 259 adults over the
with 94% of the judgments showing either exact age of 60. All scales correlated significantly with
agreement or agreement within one step of the each other, with coefficients ranging from a low
5-step response scale. The five components do of .24 to a high of .99 and a median correlation of
intercorrelate with each other, from a low of .48 .64. The two Neugarten scales correlated .63 with
(for resolution vs. mood tone) to a high of .84 (for each other and .76 and .74 with the Lawton scale.
zest vs. mood tone), with a median r of about .71. Doyle and Forehand (1984) studied survey
In this sample, there was no relationship between data that had been collected in 1974 with a
total life satisfaction and either age or gender, nationally representative sample of noninstitu-
but there was a relationship (.39) with socioe- tionalized Americans aged 18 and older. The sur-
conomic status, and with marital status (married vey was designed specifically to facilitate compar-
persons having higher scores). The interview rat- isons between three age groups: those aged 18 to
ings were correlated with ratings made by a clin- 54, 55 to 64, and 65 and older. As part of the sur-
ician who actually interviewed the participants, vey, respondents were administered a version of
with a resulting correlation of .64. the Neugarten Life Satisfaction Index discussed
Because the above scale is based on interview above. They found that life satisfaction decreased
material, its use would be time consuming and with advanced age, although the decline was very
limited to those occasions where a participant is, apparent only in those who were in their 70s and
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 263

80s. Lowered life satisfaction was a function of indicated dissatisfaction (a rating of 3 or less)
poor health, loneliness, and money problems. on that item, or if the correlation between the
Thus age correlates with life satisfaction, but it item and the total (i.e., internal consistency) was
is not age per se, but rather some of the concomi- less than .65. These procedures resulted in a 24-
tants, such as lesser income and greater health item scale, that can be completed in 6 to 8 min-
problems, that create such a relationship. utes, with 20 items that address specific areas of
There is quite a lot of research on the concept of marital distress (e.g., the day-to-day support my
self-perceived quality of life among older persons spouse provides: very dissatisfied to very satis-
but the psychometric adequacy of the measures fied). These 20 items when summed generate a
themselves has not been studied extensively. Illus- marital-satisfaction scale score.
trative concerns can be found in Carstensen and Study 2. Here the MSQFOP was administered
Cone (1983) who administered the Neugarten to 40 married persons (mean age of 63 years),
Life Satisfaction Index and the Philadelphia Geri- who were then retested some 12 to 16 days later.
atric Center Morale Scale (see below) to a sample Test-retest correlations for the individual items
of 60 persons, aged 66 to 86. The two scales corre- ranged from .70 to .93 and was .84 for the total
lated .64 with each other. Both scales also corre- score.
lated significantly with a measure of the tendency Study 3. The MSQFOP was administered to a
to represent oneself in a socially favorable light sample of 56 persons (mean age of 63.5 years)
(discussed in Chapter 16). The authors wondered along with the Locke-Wallace Marital Adjust-
if life satisfaction really decreases with age, or if ment Test (Locke & Wallace, 1959), which is a fre-
people are merely willing to endorse less desirable quently used marital satisfaction inventory. The
item content as they get older. total scores on the two inventories correlated .82.
Study 4. The aim of this study was to develop
norms for the MSQFOP and to examine its fac-
MARITAL SATISFACTION
tor structure, homogeneity, and construct valid-
The life satisfaction of older people is substan- ity. The MSQFOP and several other measures
tially related to their marital relationships (e.g., were administered to a sample of 313 married
Dorfman & Mofett, 1987; Medley, 1980). S. N. persons, with a mean age of 66 years. A factor
Hayes and his colleagues (1992) felt that the mari- analysis suggested one major factor, composed
tal satisfaction inventories available were not fully of 16 of the 20 items, that accounted for 58% of
appropriate for older individuals, and therefore the variance, and two smaller factors account-
carried out a series of five studies to develop such ing for 6% and 5% of the variance; these factors
a questionnaire, which they called the MSQFOP were labeled as communication/companionship,
(Marital Satisfaction Questionnaire for Older sex/affection, and health. As the authors point
Persons). out, given that the items were initially retained
Study 1. The first step was to generate a pool of based on their item-total correlation, it is not sur-
items based on the available literature, on other prising that there is one major factor.
available questionnaires, and structured inter- Homogeneity was assessed by computing the
views with a small sample of older individuals and Cronbach coefficient alpha – the obtained value
professional workers. This procedure yielded an was .96 for men separately, and also for women.
initial pool of 120 items that were then reviewed Thus the 20-item scale seems to be quite homo-
to remove redundancies, etc., and resulted in geneous for both genders, in line with the results
a preliminary version of 52 items answered on of the factor analysis. A number of gender differ-
a 6-point response scale from very dissatisfied ences were however obtained, with men scoring
to very satisfied. Note that the even number of higher than women on the total score, and on the
responses was chosen purposely to minimize a communication/companionship factor. Finally,
“central response” tendency. The 52-item ver- correlations between the MSQFOP and other
sion was then administered to 110 older mar- related measures were almost all statistically sig-
ried persons, whose mean age was 69.9 years. nificant, and in many cases they were substantial
An item analysis was then undertaken and items and supportive of the construct validity of this
were eliminated if less than 5% of the sample measure.
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

264 Part Three. Applications of Testing

Study 5. In this study, 26 couples with mean “consultation” this was raised to .68). A similar
age of 65, were videotaped as they discussed an procedure was used with the second sample.
important marital problem. These tapes were The 41 items of the scale were then correlated
then rated as to overall positiveness vs. negativ- with these rankings and, on the basis of vari-
ity of the spouses’ actions toward the partner. ous statistical analyses, 22 items with a yes no
Scores on the MSQFOP correlated significantly response format were retained. Morris and Sher-
with such ratings, somewhat higher for men than wood (1975) revised the scale to 15 items, while
for women. Although the authors presented this Lawton (1975) revised the scale to 17 items.
as evidence for predictive validity, the results The scale is purposely short so as not to fatigue
are probably best understood in the context of the respondent. It can be administered in written
construct validity. or oral form, individually or in groups. Examples
of items are: “I have as much pep as I did last year,”
and “Life is hard for me most of the time.” The
original sample consisted of some 300 residential
MORALE
clients, mostly female, with an average age of 78.2
Sauer and Warland (1982) criticized instruments years. Lawton’s (1975) revision involved a sam-
in the area of morale as lacking conceptual clar- ple of more than 1,000 residents, while Morris
ity, i.e., the measured concept is not well defined and Sherwood (1975) assessed almost 700 elderly
and the items generated are therefore not tied persons.
to a specific definition. Because of this, instru- Lawton (1972) factor analyzed the 22 items and
ments in this area have little in common with came up with 6 factors: (1) surgency (i.e., activ-
each other. They also often lack adequate relia- ity and freedom from anxiety and depression);
bility, both at the initial steps when the author (2) attitude toward own aging; (3) acceptance of
should produce such evidence and at subsequent status quo; (4) agitation; (5) easygoing optimism;
stages when users should also generate such infor- and (6) lonely dissatisfaction. These six factors
mation. Validity information is also often lack- intercorrelate moderately with each other, with
ing. We briefly look at two scales in this area that most coefficients in the .30s, and ranging from
are somewhat better than most others and have .16 to .52. Morris and Sherwood (1975) were only
found widespread use in a variety of studies. able to replicate factors 1, 2, and 6 and thus sug-
The Philadelphia Geriatric Center Morale Scale gested dropping five of the items. The revised
(PGC Morale Scale; Lawton, 1972). Lawton 17 items were then factor analyzed and two fac-
believes that morale is a multidimensional con- tors obtained: (1) tranquillity, and (2) satisfac-
cept composed of a basic sense of satisfaction tion with life progression. Lawton (1975) car-
with oneself, a feeling that there is a place in ried out several factor analyses and felt that a
the environment for oneself, and an acceptance three-factor solution was the best: (1) agitation,
of what cannot be changed. Originally, 50 items (2) attitude toward own aging, and (3) lonely
were written or taken from existing scales to rep- dissatisfaction.
resent the content areas thought to be related to The reliability of the scale is somewhat
morale. Several revisions of the content of the marginal. For the 22-item scale, split-half reli-
items resulted in a 41-item scale. These items were ability is reported to be .79, and K-R reliability
administered in small group sessions to 208 ten- .81. Incidentally, the split-half reliability was not
ants of an apartment dwelling for the indepen- computed by the usual method of odd-vs.-even
dently aged, whose average age was 77.9, and to 92 items, but by dividing the scale into two subsets of
residents of a home for the aged, whose mean age items matched in content. Test-retest coefficients
was 78.8. As a criterion for morale, a psychologist with intervals ranging from 1 week to 3 months,
and a nurse familiar with the patients in the first varied from a low of .22 for the surgency factor to
sample were provided with a detailed definition a high of .89 for the attitude-toward-own-aging
of morale; they were asked to rank order the 107 factor, with only one of six coefficients above .70
subjects they were familiar with into 8 groupings, in one sample, and four of the six above .70 in a
according to the degree of judged morale. The second sample. For the 17-item revision, all three
two observers agreed .45 in their rankings (after factors show Cronbach’s alphas of .81 to .85. For
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 265

the 15-item revision, the tranquillity factor shows for items (events) that are checked are simply
K-R 20 coefficients of .73 and .78, and .58 and .65 summed. For example, a “minor illness” has a
for the satisfaction-with-life progression factor. weight of 27 (one of the lowest weights), a “finan-
Part of the problem, as you might guess, is that cial difficulty” has a weight of 59, and “death of
the scale is brief, and the factors are even briefer, a spouse” has a weight of 79 (one of the highest
ranging from two to five items each. weights). E. Kahana, Fairchild, and B. Kahana,
Validity is also problematic, and some critics (1982) report correlations of .51 to .84 between
(e.g., Sauer & Warland, 1982) question whether the stress weights obtained in their study and
the scale adequately measures the domain of those originally reported by Holmes and Rahe
morale and feel that additional work, both theo- (1967). They label this “reliability,” but it can be
retical and empirical, is needed. Among the valid- argued whether this is in fact evidence for the
ity data available, we might mention correlations reliability of this scale.
of .43 and .53 of total scores with Q-sort eval- The use of such life events is a popular one,
uations of morale, by judges familiar with the with specific scales developed for various target
subjects, and a correlation of .57 with another groups such as college students and the elderly.
measure of morale (Lawton, 1972). Lohmann Higher total scores are reflective of greater stress
(1977) correlated the PGC Morale Scale with nine and may be predictive of subsequent events, such
other measures of psychological well-being and as becoming physically ill. We discuss some of the
obtained correlations ranging from .47 to .79. relevant issues when we discuss the Holmes and
Clearly, morale and psychological well-being are Rahe (1967) scale which began this whole field
related concepts, but are they identical? (see Chapter 15).

COPING OR ADAPTATION DEATH AND DYING

Coping or adaptation basically involves the Death and dying is a central concern for all of
efforts of an individual in solving real-life prob- us, but is particularly salient for older persons,
lems (E. Kahana, Fairchild, & B. Kahana, 1982). if for no other reason than the increase in ill
Often the focus is on problems that represent health and the more frequent death of others in
everyday life stresses or major life crises. A good the lives of older persons. Much of the focus has
example of an instrument in this area is the been on scales that measure the fear of death (see
Geriatric Scale of Recent Life Events (E. Kahana, Chapter 8), Marshall (1982) reviewed 32 instru-
Fairchild, & B. Kahana, 1982). Of the 55 items ments having to do with death and dying and
on this scale, 23 were taken directly from the divided them into 5 topical categories: measures
Holmes and Rahe (1967) Social Readjustment of the experience of death; measures of aware-
Rating Scale (see Chapter 15), and 8 more items ness of impending death; measures of death anx-
were altered in various ways. Additional items iety (the largest category); measures of other atti-
particularly relevant to older persons were then tudes toward death; and measures of behavior
added. The scale is thus composed of items such and planning in response to death. As an exam-
as “minor illness,” “death of a close friend,” ple, although not necessarily a typical one, let
“change in residence,” “retirement,” and “mar- us look at the death-images scale developed by
riage of a grandchild.” Respondents are asked to Back (1971). This author selected 25 metaphors
indicate whether the event has occurred in their or phrases to describe death, such as “an infi-
lives, and the degree of readjustment or change nite ocean,” “a falling curtain,” and “a bursting
required by a given event, on a 0 to 100 scale. The rocket.” Each of these items is printed on a sepa-
scale can be administered as a questionnaire or as rate card and the respondent is asked to sort the
an interview. The authors suggest that with older 25 cards into 5 piles, ranging from most appro-
subjects the interview format is preferable. priate to least appropriate as images for death.
The initial normative sample consisted of 248 This is done in a two-step procedure, where
individuals aged 60 years or older, with a mean the respondent first selects the best five images,
age of 70.8 years. To score the questionnaire and then the worst five images, and then five “fairly
obtain a total “stress” score, the “stress weights” bad” images. The instructions are incomplete,
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

266 Part Three. Applications of Testing

but presumably the respondent goes on to select differential diagnosis. Many of these procedures
five images that are “fairly good” and the five that are actually brief mental-status exams. Examples
remain are “neutral.” Placement of each item in of such procedures are the Mini-Mental State
a pile is then numerically translated so that 1 = Examination (M. D. Folstein, S. E. Folstein, &
best, 2 = fairly good, 3 = neutral, 4 = fairly bad, McHugh, 1975), the Short Portable Mental Sta-
and 5 = worst. tus Questionnaire (Pfeiffer, 1975), and the Cog-
This instrument was used in a larger study nitive Capacity Screening Examination (Jacobs,
of some 502 adults aged 45 to 70. Seven of the Bernhard, Delgado, et al., 1977).
items showed a gender difference, with three 2. Mental status exams. These exams are longer
items liked more by males and four items liked than the screening tests mentioned above and
more by females. Five of the items showed a rela- typically take closer to 1 hour to administer. Typ-
tionship with age, but Marshall (1982) indicates ical of these exams is the Mattis Dementia Rating
that none of the death factors had a relationship Scale (Mattis, 1976), which consists of five sub-
to age, and no reliability or validity data is pre- tests that evaluate attention, initiation and per-
sented. In a subsequent study, Ross and Pollio severation, constructional ability, conceptualiza-
(1991) used this set of metaphors as an inter- tion, and memory. Other scales representative of
view procedure to study the personal meaning of instruments in this category are the Alzheimer’s
death. Although most of their data is impression- Disease Assessment Scale (Rosen, Motts, & Davis,
istic and does not directly address the reliability 1984) and the Neurobehavioral Cognitive Status
and/or validity of this instrument, the results nev- Examination (Kiernan, Mueller, Langston, et al.,
ertheless can be seen as supportive in the context 1987).
of construct validity. The use of metaphors as test 3. Neuropsychological screening batteries.
items is intriguing and potentially useful, but very Schmitt and Ranseen (1989) cite three appro-
few investigators have used this method. Knapp aches under this heading. The first consists of
and Garbutt (1958) used it with time imagery, standard tests or subtests from different batteries.
and more recently Domino used it with cancer For example, Filskov (1983) used various subtests
imagery (G. Domino, Affonso, & Hannah, 1991; from the WAIS and the Halstead-Reitan Neu-
G. Domino & Lin, 1991; G. Domino, Fragoso, ropsychological Test Battery. A second approach
& Moreno, 1991; G. Domino & Lin, 1993; is illustrated by the work of Benton and his col-
G. Domino & Pathanapong, 1993; G. Domino leagues (Benton, Hamsher, Varney, et al., 1983;
& Regmi, 1993). Eslinger, Damasio, Benton, et al., 1985), who
developed a battery of tests for the assessment
of dementia. A third approach is illustrated by
NEUROPSYCHOLOGICAL ASSESSMENT
the work of Barrett (Barrett & Gleser, 1987; Bar-
Neuropsychological assessment basically in- rett, Wheatley, & La Plant, 1983), who developed
volves the assessment of cognitive and behav- a “brief” (2 hour) neuropsychological battery
ioral factors that reflect neurological disease. As modeled on the Halstead-Reitan Neuropsycho-
Kaszniak (1989) states, neuropsychological eval- logical Test Battery.
uation is playing an increasingly important role 4. Neuropsychological batteries. There are
in the assessment of older adults. A wide variety of basically two major neuropsychological batter-
measures and approaches are used in neuropsy- ies available: the Halstead-Reitan Neuropsycho-
chological testing, but at the risk of oversimplifi- logical Test Battery and the Luria-Nebraska Neu-
cation, we can identify the following major cat- ropsychological Battery. Both of these are rather
egories of tests (as listed in Schmitt & Ranseen, extensive to administer and require a well-
1989): trained clinician to interpret the results. Both
1. Brief screening procedures. A number of instruments have been widely used, and there
procedures have been developed that are brief is a substantial body of literature that gener-
and are used primarily as screening procedures ally supports their reliability and validity (see
to be followed by more extensive testing where Chapter 15).
appropriate. At the same time, these proce- 5. Tests of memory functioning. As Schmitt
dures are often used for other purposes, rang- and Ranseen (1989) state, an adequate mem-
ing from the assessment of changes over time to ory test should assess both input and output
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 267

functions that are involved in the registration, Lathi, et al. (1994) presented a scale to measure
storage, and retrieval of information that is to the disease severity in patients with advanced
be remembered. Such assessment should cover dementia of the Alzheimer type. They argued that
various spheres, such as visual memory, auditory currently there are no such instruments and one
memory, and spatial memory. Both recall and is needed to make decisions related to health-
recognition need to be assessed. The most com- care policies as well as planning. They also point
monly used measure, despite a number of limita- out that the first symptoms of Alzheimer’s are
tions, is the Wechsler Memory Scale. Other mea- often cognitive deficits, which if serious enough
sures in this category are the Benton Visual Reten- severely limit the usefulness of cognitive tests.
tion Test (Benton, 1974), the Randt Memory Test There are a number of measures of activities of
(Randt, Brown, & Osborne, 1980), and the Den- daily living, but these authors felt that the mea-
man Neuropsychology Memory Scale (Denman, sures are more suitable for patients in the early
1984). and middle stages of Alzheimer’s, rather than the
6. Measures of functional abilities. These mea- advanced patients they are interested in. Because
sures cover a wide variety of activities of daily liv- of these concerns, Volicer, Seltzer, Rheaume, et al.
ing, such as the ability to use the telephone, per- (1987) developed the Bedford Alzheimer Nursing
sonal grooming, dressing, and managing one’s Scale; based on that scale they developed a seven-
financial affairs. item scale to measure severity. The seven items
Basically all of the tests listed above rest on cover dressing, sleeping, speech, eating, mobility,
the reasonable assumption that organic brain muscles, and eye contact, and each item presents
damage results in the deterioration of psycho- four descriptors from which the rater checks the
logical functioning, and that the measurement one that is most appropriate. For example, the
of such functioning will reflect the nature and eating item contains these four choices: (1) eats
degree of brain impairment. The tasks that are independently, (2) requires minimal assistance
used to assess such deterioration require a skilled and/or coaxing, (3) requires moderate assistance
examiner and are usually multivariate, i.e., they and/or coaxing, and (4) completely dependent.
contain many subtests and thus require extensive Thus the scale is brief and easily completed by
time to administer. Two important questions rel- nursing personnel. The items are scored on a 1
evant to the validity of these tests are: (1) Do the to 4 basis, so that a total score of 7 indicates no
test scores differentiate brain-damaged older per- impairment, and a score of 28 indicates complete
sons from non-brain-damaged older persons? (2) impairment.
Do the test scores differentiate brain-damaged For 3 samples with a total of 77 patients,
older persons from those who are function- internal consistency alphas were reported to be
ally disordered, i.e., those who have a disorder between .64 and .80. Two raters were involved
such as depression, which presumably serves a in this study, and the interrater reliability ranged
function. from .82 to .87. The construct validity of the scale
was assessed by comparing the scores to various
Alzheimer’s. Dementia is a disorder of cogni- indices of dependence-independence in activities
tion. That is, it involves grossly impaired think- of daily living, cognitive impairment, and lan-
ing, memory lapses, and faulty reasoning. Thus guage abilities. In general, the correlations ranged
the term dementia does not refer to a single ill- from the low .40s to the mid .60s. One could easily
ness, but to a group of conditions all of which argue that these results represent criterion valid-
involve the same basic symptoms, namely a pro- ity rather than construct validity, as no theoretical
gressive decline in intellectual functions. Two of rationale is presented.
these conditions account for most of the patients:
one is Alzheimer’s where the neurons of the brain Memory assessment. Although concern about
cells deteriorate, and the other is multi-infarct failures of memory in everyday life is a topic
dementia (an infarct is a small stroke, and this that intrigued the pioneers of psychology such
condition is due to the effects of many small as William James and Sigmund Freud, a con-
strokes that damage brain tissues). centrated effort to study everyday memory
Many current efforts are aimed at the diagno- did not occur until the 1970s. One approach
sis of Alzheimer’s. For example, Volicer, Hurley, was the development of self-reported memory
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

268 Part Three. Applications of Testing

questionnaires. If responses to the question- (Wechsler, 1987), when it was standardized on a


naires could be shown to correspond to observed nationally representative sample of adults from
behavior, then the questionnaires could pro- aged 16 through 74. The WMS was intended as
vide a valuable tool. In addition, the question- a rapid, simple, and practical memory examina-
naires, whether or not they parallelled behav- tion. It consists of seven subtests: personal and
ior, could provide valuable insights into a per- current information (How old are you? Who is
son’s belief about their memory and memory the President of the United States?); orientation
loss (Herrmann, 1982). Older adults do com- (What day of the month is this?); mental con-
plain of memory difficulty more than younger trol (i.e., sustained attention, such as counting
adults (Zelinski, Gilewski, & Thompson, 1980). backwards from 20 to 1); logical memory (a brief
Unfortunately, questionnaires that assess the passage is read and the subject is asked to recall
subjective frequency of everyday memory fail- the ideas in the passage); digit span (recall of
ures in older adults have typically very poor digits forward and backward); visual reproduc-
test-retest reliability and low internal consis- tion (simple geometric figures are presented for
tency (Gilewski & Zelinski, 1986). In addi- 10 seconds each; the subject is asked to draw them
tion, the relationship between subjective reports from memory); and associate learning (word
of memory difficulty and actual performance pairs are read, such as cat-window, and the sub-
on verbal memory tests seems to be a func- ject is asked to recall the second word when the
tion of other factors, such as diagnostic status. first is presented). However, scores on the seven
For example, depressed older adults tend to com- subtests are combined into a single summary
plain about memory difficulty, but may show no score called the Memory Quotient, so that it is dif-
memory performance deficits, while somewhat ficult to compare the various aspects of memory
of the opposite pattern may be true of patients performance. Norms were initially available for
with Alzheimer’s disease (Kaszniak, 1989). adults up to age 64, and subsequently extended
Herrmann (1982) reviewed 14 memory ques- to include those 80- to 92-years-old (Klonoff
tionnaires. Although he concluded that as a group & Kennedy, 1965, 1966; Meer & Baker, 1965).
these questionnaires were reliable, a more care- The WMS was also intended to identify organic
ful analysis of the data presented might lead to problems associated with memory disorders, but
a more pessimistic conclusion. Of the 15 instru- subsequent studies showed that the scale did
ments (one has a short version that is listed sepa- not differentiate among psychotic, neurotic, and
rately), 7 do not have reliability data reported. For organic patients when age and IQ were controlled
the 8 that do, the reliability coefficients (mostly (J. Cohen, 1950). The WMS was criticized for
test-retest) range from .46 to .88, with 6 of the a number of limitations, including the prepon-
coefficients below .70. derance of verbal stimuli and inadequate inter-
Two major limitations of many memory tests rater agreement on two of the subtests (Erick-
is that, first, they do not reflect the current state son & Scott, 1977; Prigatano, 1978). In 1987, the
of knowledge about memory because they were WMS was revised (Wechsler, 1987) and now con-
developed years ago, and, second, they bear little sists of 13 subtests, including 3 new nonverbal
resemblance to the tasks of everyday life (Erick- subtests.
son & Scott, 1977). More relevant assessments are The WMS-R was standardized on a sample
beginning to be available, at least on a research of approximately 300 individuals aged 16 to 74
basis. For example, Crook and Larrabee (1988) designed to match the general population with
developed a test battery that is fully computer- respect to race, geographic region, and educa-
ized and uses laser-disk technology to simulate tional level. The sample was stratified as to age,
memory and learning tasks encountered every- with about 50 subjects in each of six age groups,
day, such as dialing telephone numbers and the and with approximately equal numbers of men
recall of a person’s name. and women within each age group. The scale and
subtests are composed of a variety of items
The Wechsler Memory Scale (WMS) The WMS that include counting backwards, identifying
has been the most frequently used clinical previously shown abstract geometric designs,
instrument for the assessment of memory. The recalling two stories immediately after they are
scale was developed in 1945 and revised in 1987 read by the examiner as well as at the end of the
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 269

testing, a learning of word pairs for subsequent the multivariate nature of memory. The original
recall, and repeating the examiner’s performance MAC-S consisted of 102 items that described spe-
on a series of colored squares that are touched in cific memory tasks or problems encountered in
sequence. everyday life – for example, whether the person
The WMS-R contains not only an expanded remembered turning off the lights and appliances
number of subtests, but also nonverbal subtests before leaving home, or which door they came in
and delayed recall measures (e.g., remembering when shopping in a large department store or
items from stories read earlier by the examiner). mall. The items were divided into two subscales
Two of the subtests are used for screening pur- of equal length: (1) ability, with items indica-
poses and are kept separate from the rest of the tive of the ability to remember specific types of
scale. In addition, the revised scale allows the information, for example, the name of a person
report of separate scores for various components just introduced; and (2) frequency of occurrence,
of memory performance. The WMS-R yields two with items indicative of how often specific mem-
major scores, the General Memory Index, and the ory problems occurred, for example, going to a
Attention/Concentration Index. In addition, the store and forgetting what to purchase. On the
General Memory Index can be subdivided into ability scale, the response format was a Likert-
a Verbal Memory Index and a Visual Memory type scale from very poor to very good, while on
Index. Finally, there is a Delayed Recall Index. the frequency of occurrence the response choices
Unlike many other test revisions where what is ranged from very often to very rarely.
changed is typically minor and often cosmetic, On the basis of factor analysis the MAC-S was
the WMS became in its revision a vastly supe- reduced to 49 items. Crook and Larrabee (1990)
rior scale (Robertson-Tchabo & Arenberg, 1989; administered the scale to 1,106 healthy volun-
and for a review of the WMS-R compared with teers, with a mean age of 56 and a range from 18
another memory scale, see Zielinski, 1993). to 92. The protocols were factor analyzed and the
authors obtained five factors for the Ability scale
Self-rating scales. Self-rating scales also are used and five factors for the Frequency of Occurrence
widely in the study and assessment of memory. scale. These are listed in Table 10.2.
Gilewski and Zelinski (1986) gave four important The authors found essentially the same fac-
reasons for using self-rating scales to assess mem- tor structure when the total group was ana-
ory in older adults. They pointed out that there is lyzed according to various age subgroups. They
a relationship between memory complaints and also found a lack of association between MAC-
memory performance in healthy older individu- S scores and both age and gender. (To see
als. Second, complaints of memory impairment how another memory scale was developed see
may be early signs of a subsequent dementia, Gilewski, Zelinski, & Schaie, 1990.)
although in advanced stages of Alzheimer’s, for
example, there may be no relationship between
complaint and actual performance, in that the
DEPRESSION
patient may deny memory deficit. Third, com-
plaints of memory deficit may be diagnosti- There seems to be agreement that depression rep-
cally related to depression, and may in fact serve resents a major public health problem and that
as a useful differential diagnostic sign between depression can and does occur late in life, with
depression and dementia. Finally, memory com- high rates of depression in clients over the age of
plaints may be good indicators of how a person 65. Thus, depression seems to be the most com-
perceives their general cognitive functioning as mon functional psychiatric disorder among older
they become older. persons, although there is some question whether
As an example of the self-rating memory scales the prevalence of depression increases with age. A
that are available, consider the Memory Assess- number of authors point out that what is called
ment Clinics Self-Rating Scale (MAC-S; Winter- depression in older persons may in fact repre-
ling, Crook, Salama, et al., 1986). The MAC-S sent reactions to the economic and social difficul-
was developed because the authors felt that avail- ties they encounter, grief over the loss of friends
able scales had either inadequate normative data, and family, and reactions to physical illness and
used poor wording of items, or did not consider problems.
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

270 Part Three. Applications of Testing

Table 10–2. Factors on the Memory Assessment Clinics Self-Rating Scale


Ability Scale Example
Factor
1. Remote personal memory Holiday or special-occasion memory
2. Numeric recall Telephone numbers
3. Everyday task-oriented memory Turn off lights
4. Word recall/semantic memory Meaning of words
5. Spatial/topographic memory How to reach a location
Frequency of Occurrence Scale
1. Word and fact recall or Semantic Memory Forgetting a word
2. Attention/concentration Having trouble concentrating
3. Everyday task-oriented memory Going into a room and forgetting why
4. General forgetfulness Forgetting an appointment
5. Facial recognition Failing to recognize others

The literature suggests that somatic com- and a clinical point of view. Although this area of
plaints are more prominent in older depressed testing is relatively young, a number of advances
patients than in younger individuals. However, have taken place, but much more needs to be
complaints of fatigue, pain, or lack of energy, done. The areas that have been presented in
which in a person may be reflections of depres- this chapter are illustrative of the various issues
sion, may in an older person be realistic evidence and challenges faced by both practitioners and
of being old, and not necessarily depressed. Part researchers alike.
of the complexity of assessing depression in older
persons is that many self-report scales of depres-
SUGGESTED READINGS
sion contain items that have to do with somatic
symptoms, such as sleep disturbances and dimin- Costa, P. T., & McCrae, R. R. (1984). Concurrent val-
ished energy levels. Because older persons do idation after 20 years: The implications of personal-
tend to have more physical illnesses, endorsement ity stability for its assessment. In N. W. Shock, R. G.
of these items may not necessarily be reflective Greulich, R. Andres, D. Arenberg, P. T. Costa, E. G.
Lakatta, & J. D. Tobin (Eds.), Normal human aging:
of depression (Blazer, Hughes, & George, 1987;
The Baltimore longitudinal study of aging (NIH Pub-
Himmelfarb, 1984; Newman, 1989). lication No. 84-2450). Washington, D.C.: U.S. Public
A wide variety of procedures are used to assess Health Service.
depression in older persons. These include the
What happens to personality as a person ages? One answer
depression scales discussed in Chapter 7 and oth- was given by what is called the Kansas City Studies (see Neu-
ers, particularly the three most common scales, garten & Associates, 1964): As people aged they became more
the Beck Depression Inventory (A. T. Beck, Ward, preoccupied with themselves, more emotionally withdrawn –
what eventually became known as disengagement theory.
Mendelson, et al., 1961), the Hamilton Rating This suggested reading, covering what is known as the Bal-
Scale for Depression (Hamilton, 1960), and the timore study, gives a different answer – there is personality
Zung Self-Rating Depression Scale (Zung, 1965). stability as one ages.
Other approaches include multivariate instru- Gallagher, D. (1986). The Beck Depression Inventory
ments such as the MMPI, projective tests such as and older adults. Clinical Gerontologist, 5, 149–163.
the Gerontological Apperception Test (R. L. Wolk
This article reviews the development and utility of the BDI
& R. B. Wolk, 1971) and structured interviews, with particular emphasis on the use of the BDI with older
such as the Schedule for Affective Disorders and persons. The author discusses the usage of the BDI, the reli-
Schizophrenia (Spitzer & Endicott, 1977). ability, validity, factor structure, and other aspects of one of
the most popular measures of depression.

Herrmann, D. J. (1982). Know thy memory: The use


SUMMARY of questionnaires to assess and study memory. Psycho-
logical Bulletin, 92, 434–452.
Psychological testing of older persons presents a The author reviews 14 questionnaires designed to assess peo-
number of challenges from both a psychometric ple’s beliefs about their memory performance in natural
P1: JZP
0521861810c10 CB1038/Domino 0 521 86181 0 March 4, 2006 14:18

Older Persons 271

circumstances. Research findings suggest that responses to An instructive example of the development of a scale within
these questionnaires are reliable but they correspond only a clinical context, for use with older persons.
moderately with a person’s memory performance, suggest-
ing that people’s beliefs about their memory performance are
stable but not very accurate. DISCUSSION QUESTIONS
Lewinsohn, P. M., Seeley, J. R., Roberts, R. E., & Allen, 1. You have been assigned to test some elderly
N. B. (1997). Center for Epidemiologic Studies Depres- people living in a nursing home, while your class-
sion Scale (CES-D) as a screening instrument for
mate is testing individuals of the same age living
depression among community-residing older adults.
Psychology and Aging, 12, 277–287.
in a retirement community. How might the two
experiences differ in terms of testing?
A study of more than 1,000 older adults designed to assess
the CES-D scale (covered in Chapter 7), as a screening instru- 2. Consider the concept of “psychological well-
ment. The article presents data and uses such concepts as sen- being”. What might be the components of such a
sitivity and specificity, which we discussed. A bit advanced in concept?
its use of statistical analyses, but worth reading.
3. How might you validate the MSQFOP?
Libman, E., Creti, L., Amsel, R., Brender, W., & Fichten,
4. How would you develop a memory scale for
C. S. (1997). What do older good and poor sleep-
use with the elderly?
ers do during periods of nocturnal wakefulness? The
Sleep Behaviors Scale: 60 +. Psychology and Aging, 12, 5. What do you consider to be the three major
170–182. points of this chapter?
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

11 Testing in a Cross-Cultural Context

AIM What are the problems associated with using psychological tests with minority
individuals and those of another culture? If for example, we wish to administer the
WISC-R to a black child, or we translate the test into French for use with French children,
will the test still be valid? Basically, this is the issue we look at in this chapter.

aptitude, ability, and achievement; we can use the


INTRODUCTION
broader label of cognitive-ability tests to cover
In this chapter, we look at cross-cultural test- these various aspects.
ing, that is, at some of the ways in which cul- During the 1960s, the use of standardized
ture and testing can interact. We use the term tests with ethnic minorities became a major
“culture” in two different ways: (1) to delineate issue. Critics claimed that standardized tests:
people living in different countries, for exam- (1) were loaded with items based on white
ple, the United States vs. the People’s Republic middle-class values and experiences; (2) penal-
of China; and (2) to refer to minority groups ized children who had linguistic styles different
within a particular country, for example, blacks from that of the majority culture; (3) assessed
and Hispanics living in the United States. There cognitive styles often substantially different from
are of course many ways of defining culture. For those found in low-income families; (4) fostered
our purpose, we can define culture as a set of a dual educational system by excluding minor-
shared values and behaviors that include beliefs, ity children from regular educational programs;
customs, morals, laws, etc., that are acquired by a (5) were of no use in formulating instructional
person, shared in common with other members programs; and (6) were culturally biased and
who are typically in close proximity, but different discriminated unfairly against racial and ethnic
from those held by others who often live in a dif- minorities (P. Henry, Bryson, & C. A. Henry,
ferent geographical setting (D. W. Sue & D. Sue, 1990).
1990).

Blacks vs. whites. Much of the controversy


MEASUREMENT BIAS
on test bias revolves around the performance
The issue of test or measurement bias is a central of blacks on cognitive-ability tests and, to a
one for all who are concerned with developing lesser extent, around the performance of His-
and using tests. There is a substantial body of panics, primarily Mexican-Americans. As the
literature on the topic, with some rather complex Hispanic population in the United States con-
statistical issues, and even entire books devoted tinues to grow, the concern of possible test
to the topic (e.g., Berk, 1982; Osterlind, 1983). bias has become more salient with this popu-
Most concerns about test bias are related to tests lation, particularly with the issue of bilingual-
of intelligence and, to a lesser degree, to tests of ism (Olmedo, 1981). Asian minority groups seem
272
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 273

to do well on cognitive-abilities tests and on The cultural test-bias hypothesis. This hypoth-
academic achievement, so the issue of test bias esis contends that group differences on mental
is not brought up. Indeed, differences in average tests are due to artifacts of the tests themselves and
cognitive performance between white and black do not reflect real differences between groups that
students in the United States do exist, and often differ on such demographic variables as ethnic-
they approach a full standard deviation. This ity, race, or socioeconomic status. There is, in fact,
means that a level of performance that is achieved almost no evidence to support such a hypothesis
by about 84% of white students is achieved by with tests that have been carefully designed, such
only 50% of their black peers. Most psychologists as the Stanford-Binet or the Wechsler tests (A. R.
would argue that such results do not reflect test Jensen, 1980).
bias, but rather the cumulative effects of societal
bias.
We can talk about bias at three different stages: Eliminate tests. In the 1970s, black psycholo-
(1) before the test, referring to those societal and gists demanded an immediate moratorium on
environmental aspects that result in discrimina- all testing of black persons (R. L. Williams,
tion, lower self-esteem, poorer nutrition, fewer 1970), with the assumption that most, if not all
opportunities, etc.; (2) in the test itself as a mea- tests, were intrinsically biased against minorities
suring instrument; and (3) in the decisions that (Messick & Anderson, 1970). The typical
are made on the basis of test scores. Most psy- response to this demand was to argue that tests
chologists would argue that it is far more likely per se were not biased, but that tests were mis-
for bias to occur in the first and last stages than used. Some argued that both questions – whether
in the second stage. a test is valid or not, and whether a test should
Messick and Anderson (1970) indicate that be used in a specific context – needed to be
there are three possible sources for the typical addressed. The first question is a scientific one:
finding that minority children do less well than the answer can be found in the psychometric
majority children on tests of cognitive abilities: properties of a test. The second question is an
ethical one whose answer can be found in terms
1. The test may measure different things for dif- of human values. Messick and Anderson (1970)
ferent groups. To assess this, the reliability and argued that not using tests would not eliminate
validity of the test needs to be studied separately the need to make decisions, and that alternate
for each group, and the results need to be compa- decision-making mechanisms such as interviews
rable to conclude that the test is not biased. Con- and/or observations would be more costly, more
struct validity needs to be assessed by looking at biased, and less valid. Others have argued rather
the pattern of correlations between the test and convincingly that the social consequences of not
other measures; comparability of results across using tests are far more harmful than the con-
different groups would be evidence for lack of sequences of using tests to make educational
test bias. decisions (e.g., Ebel, 1963). There are also some
2. The test may involve irrelevant difficulty – for arguments for using tests. Cognitive tests are of
example, an answer sheet that is confusing or dif- value in documenting patterns of strengths and
ficult to use, testing conditions that may increase weaknesses in all children and are useful in doc-
anxiety more for one group than for another, or umenting change and progress. Tests represent
items that differentially favor one group over the an objective standard free of examiner prejudice.
other. In fact, such issues are of concern to test Tests are equally good predictors of future perfor-
constructors, and most well-standardized tests mance for white and for minority children. Tests
cannot be faulted on such aspects. can be useful for securing and evaluating special
3. The test may accurately reflect ability or services in the schools, such as Head Start pro-
achievement levels. Lower scores on the part of grams. Without appropriate evaluations, chil-
a minority group do not necessarily reflect bias dren may not receive the services they are enti-
in measurement, but may reflect the effects of tled to (Wodrich & Kush, 1990). Another general
poverty, prejudice, and inequality of educational criticism that is made of intelligence tests is that
opportunities. they ignore the multicultural aspects of American
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

274 Part Three. Applications of Testing

society, that they treat individuals as if they were oranges . . .), the use of words that are less familiar
culturally homogeneous (Samuda, 1975). to one group than another may result in a biased
Somewhat more specific arguments against the item. If it were a matter of only vocabulary then
testing of minority children are also expressed, possible solutions might be relatively easy. How-
and most of these can be subsumed under four ever, the matter becomes more complex because
categories: aspects such as test anxiety or differences in moti-
vation may interact with aspects of the test items.
1. Cognitive tests are biased because they have Such interactions can in fact be studied experi-
been developed to mirror middle-class, white val- mentally, and recently approaches based on item
ues and experiences. The counterargument is that response theory (see Chapter 2) have been used
in fact there is no evidence that cognitive tests are (e.g., Linn & Harnisch, 1981). Ultimately, one
biased against minority members. must ask why is a particular test item “biased”?
2. Minorities are not represented in the norms If it is a matter of vocabulary, for example, might
and therefore score interpretation is inappropri- that not reflect an “instructional” or a learning
ate. Although this argument may apply to some bias rather than a test-item bias? In many ways,
tests, it does not apply to the major cognitive mea- test bias is similar to the concept of test validity
sures such as the Stanford-Binet or the WISC, in that there is no one index or procedure that in
whose norms are representative of the general and of itself allows us to say, “this test is biased.”
population according to census parameters. Rather bias, like validity, is arrived at through the
3. Minority students do not have the appropriate gathering of substantial evidence, and an objec-
test-taking skills, sophistication, or orientation tive analysis of that evidence (Sandoval, 1979).
(e.g., awareness of the need to answer rapidly on
a timed test). Poor test-taking skills are of course Limited English proficiency. English is not the
not the sole province of minority children; non- first language for a substantial number of stu-
minority children can be just as deficient. The dents in the United States, and they have limited
issue here is of competence in administering tests. English proficiency. Their number seems to be
The examiner should be aware and recognize growing and, because of increased use of stan-
individual factors that may interfere with or limit dardized tests in school systems, there is great
the child’s performance, regardless of the ethnic- concern that test results for these students may
ity of the child. Test results for a particular child either be less valid or misused.
may indeed be limited and even invalid, although Lam (1993) indicates that test developers, par-
the test itself may be valid from a psychometric ticularly of standardized achievement tests, make
point of view. five assumptions: (1) test takers have no lin-
4. Most examiners are white and use standard guistic barriers that might interfere with their
English with detrimental effects on minority chil- performance on the test, i.e., they can follow
dren’s scores. The literature indicates that the instructions, understand the test items, and have
effect of examiner race on cognitive test scores is adequate time to complete the test; (2) the test
negligible (Sattler & Gwynne, 1982). Fewer stud- content is suitable and of appropriate difficulty
ies have been done on the effects of using standard level for the test taker; (3) test takers have the
English vs. Black dialect, but the results here also required test sophistication for taking standard-
suggest negligible differences (L. C. Quay, 1974). ized achievement tests; (4) test takers are prop-
erly motivated to do well on the test; and (5) test
takers do not have strong negative psychological
Extraneous variables. It is sometimes argued reactions (such as anxiety or feeling stressed) to
that certain types of test items are biased against testing.
particular groups. In particular, there is concern Lam (1993) feels that these assumptions may
that what are nonessential characteristics of par- at least be questionable with language minor-
ticular test items may result in poorer perfor- ity students, and therefore their test results may
mance for minority children. For example, if we not be as reliable and valid. This issue is a well-
wish to measure arithmetic skills and the prob- recognized one and in fact is incorporated in
lems are presented as vignettes (e.g., John has six the Standards for Educational and Psychological
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 275

Testing discussed in Chapter 1. Part of the solu- nal criterion study is that by Reschly and Sabers
tion consists of strategies that reduce the proba- (1979), while an example of an internal crite-
bility that any of the five assumptions are vio- ria study is that of Sandoval (1979). Incidentally,
lated. These include translating tests into the both of these studies found the WISC-R not to be
child’s native language, developing ethnic spe- biased against minority children including blacks
cific norms, developing tests that accommodate and Mexican-Americans.
the cultural differences of various people, extend- A similar discussion is presented by Clarizio
ing time limits, using items that are relevant to (1982) who also defined test bias in terms of
the minority culture, and so on. external and internal criteria. From an external
Matluck and Mace (1973) presented a num- or predictive validity point of view, a test is unbi-
ber of suggestions regarding tests to be used with ased if the prediction of criterion performance is
Mexican-American children. With regard to for- of equal accuracy in the two samples, i.e., equiv-
mat, such tests should assess separately the child’s alent regression equations or standard errors of
receptive ability (e.g., listening comprehension) estimate. From an internal or construct validity
vs. productive ability. The test should use appro- point of view, a test is unbiased if it behaves the
priate stimuli – for example, for younger children same way for different groups. Evidence of such
pictorial-visual stimuli are most likely appropri- “unbias” might focus on test homogeneity, rank
ate, but verbal-auditory stimuli are not. Simi- ordering of item difficulty, loadings on “g,” and
larly, the number of items and the administra- the relative frequencies in choice of error distrac-
tion time should be appropriate – for example, tors, for the two groups being compared.
for most children 15 to 25 minutes is considered In general, investigations of internal measures
appropriate. With regard to content, the items of validity have typically found no evidence for
should be simple in language and not require test bias, whether in terms of differential reliabil-
linguistic skills that are beyond the child’s age. ity, rank order of item difficulty, factor structure,
Items should not have language or cultural bias. or other psychometric concerns. Similarly, inves-
Concerning test materials, the authors point out tigations of external measures of validity have
that sometimes “impressionistic” or sketchy line typically found no evidence for such bias. Regres-
drawings are used as stimuli, and a child may not sion equations to predict a particular outcome
have the appropriate experiential background to show no differential validity, and appear to be rel-
deal with such materials; actual objects or pho- atively valid for different ethnic and/or socioeco-
tographs might be better. Finally, with respect to nomic groups. C. R. Reynolds (1982) concludes
the test examiner, the authors point to a need to that psychological tests, especially aptitude tests,
be sensitive as to whether the examiner’s gender, function in essentially the same manner across
degree of experience, physical appearance as to race and gender. Differential validity does not
ethnicity, and so on might influence the test per- seem to exist.
formance of the child. Many “external” studies of test bias use
achievement test scores as the criterion – thus
External vs. internal criteria. A number of for example, WISC-R IQs are correlated with
authors define test bias in terms of validity scores on an achievement-test battery; the results
and distinguish between validity that focuses on of such test batteries are often routinely available
external criteria and validity that focuses on inter- in students’ folders. Critics question the use of
nal criteria. Thus A. R. Jensen (1974; 1976), such achievement test scores as the criterion and
for example, identified two general strategies for argue instead that “actual behavior” should be the
determining bias in tests, one based on external criterion. The problem seems to be that no one
criteria and the other on internal criteria. Exter- is willing to define concretely what such actual
nal criteria involve predictive validity, and assess- behavior might be, other than school grades.
ing the test’s predictive validity in minority and
majority samples. Internal criteria involve con- Eliminating test bias in test development. A
tent and construct validity, an analysis of the number of steps can be taken to attempt to elimi-
test in terms of the item content and the over- nate bias as a test is being developed. First, a sensi-
all theoretical rationale. An example of an exter- tive and knowledgeable test writer can eliminate
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

276 Part Three. Applications of Testing

obviously biased items. Item statistics can be eas- 1975) and offered a definition of test bias that
ily collected, and those items that are related to focused on predictive validity, although both
race, gender, or other irrelevant variables can content and construct validity were also consid-
be eliminated. Sometimes, matched samples are ered important. The Committee stated that “a
used for this purpose – that is, the pool of items is test is considered fair for a particular use if the
administered to a black sample and a white sam- inference drawn from the test score is made with
ple that have been matched on ability. The diffi- the smallest feasible random error and if there is
culty level of each item is computed for the two no constant error in the inference as a function
samples separately, and items that show a differ- of membership in a particular group” (Cleary,
ence in difficulty rates are eliminated. It should be Humphreys, Kendrick, et al., 1975, p. 25).
obvious, however, that when we match the sam- This definition is based on earlier work by
ples we no longer have representative samples; Cleary (1968) whose definition of test bias in
also an item that has different difficulty rates in terms of errors of prediction has become almost
different groups is not necessarily a biased item. universally accepted. Given a particular test and a
We can compute the correlation of an item with particular criterion, we can compute the regres-
the criterion and determine whether the item pre- sion line by which we use test scores to predict
dicts the criterion equally well in the two samples that criterion. If the criterion score that is pre-
(although item-criterion correlations are typi- dicted from the common (i.e., for both blacks and
cally quite low). whites) regression line is consistently too high or
too low for members of a subgroup, then the test
Test bias. Specifically, what are the problems that is said to be biased.
are perceived to be present in using tests with
minorities? C. R. Reynolds (1982) lists six: A broader view. Others have taken a broader
view of test bias. One way to define test bias is
1. Inappropriate test content; test items are used to consider those aspects that prevent a test from
that reflect primarily white middle-class experi- being valid when used with a particular individ-
ences to which minority children have had little ual in a particular instance (Bradley & Caldwell,
or no exposure. 1974). Three sources of potential bias can then
2. Inappropriate standardization samples; the be identified: (1) bias due to the test itself (e.g.,
samples are either all white or ethnic-minorities the test is not valid, or the test is unduly influ-
are underrepresented. enced by social desirability); (2) bias due to the
3. Examiner and language bias; lower test scores client (e.g., the client does not pay attention); and
for minority children reflect their intimidation (3) bias due to the situation (e.g., interfering noise
with a white examiner who speaks standard from an airplane while the test is administered).
English. Note that in this approach bias becomes lack of
4. Inequitable social consequences; because of validity; to the extent that such lack is related to
bias in tests, minority group members who are a minority group, then we have test bias. In fact,
already at a disadvantage in the marketplace, are most experts would include only the first bias
subject to further discrimination. under “test bias” and would place categories 2
5. Tests measure different constructs in minority and 3 under some other label such as error, lack
children than they do in majority children. of experimental control, individual differences,
6. Tests have differential predictive validity; they etc. To the extent that they affect testing, they
may be valid for white middle-class children, but need to be controlled, eliminated, or accounted
they are not valid for minority children. for.

A narrower view. Are there differences in mean


APA Committee study and report. In 1968, the performance on a particular test among groups
American Psychological Association Board of that differ in ethnicity? If there are, this is taken as
Scientific Affairs appointed a committee to study evidence of test bias. This particular point of view
the issue of test bias. The Committee prepared is based on the implicit notion that all people are
a report (Cleary, Humphreys, Kendrick, et al., equal on the particular variable being measured.
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 277

But the reality is quite different. We may well Underprediction means that the estimated cri-
accept the notion of “all people are created equal” terion value is lower than the actual value, while
in terms of human dignity and respect, but all overprediction means that the estimated crite-
of the scientific evidence points to the fact that rion value is higher than the actual value. If
people are quite different from each other in all under- or overprediction occurs in a systematic
sorts of ways. way, there is bias present.
Some believe that a test is culturally or racially In the second situation, the majority group
biased when a child’s performance on the test is obtains a higher mean score on the test than
compared against a culturally or racially different the minority group; the bias comes in that both
reference group that has a higher mean score. groups do equally well on the criterion. Thus
Thus, a test given to Mexican-American children for example, if we could show that on a college
is considered biased if the children’s performance entrance exam whites obtained a mean score of
is compared to Anglo norms. This is the view that 600 and blacks obtained a mean score of 400,
is promulgated by Mercer (1976), who argues for and both groups did equally well on academic
the use of “pluralistic” norms (see the discussion achievement as measured by grades, for example,
that follows on the SOMPA). Many have argued then we would have what is called intercept bias
that such reasoning is fallacious (e.g., Clarizio, (the term again referring to the regression line in a
1982). bivariate graph). In this case, the predictive valid-
Some critics assume that if a test is standard- ity coefficients for each sample would be approx-
ized on a particular ethnic group, such as whites, imately equal, so that test scores would be equally
it therefore must be biased if used with another predictive of criterion performance. However, if
ethnic group. The answer is obvious: the reliabil- a college admissions committee were to use a par-
ity and validity of a test that is used with different ticular cutoff score to admit or reject applicants, a
groups needs to be investigated empirically. greater proportion of minority applicants would
be rejected. In fact, there is no evidence that sup-
ports intercept bias, and some studies have shown
The psychometric perspective. Psychometri- that there is a slight to moderate bias in favor
cally, if a test is biased that test results in sys- of the minority group (e.g. Duran, 1983; J. E.
tematic error related to one group but not to Hunter, Schmidt, & Rauschenberger, 1977).
another. Specifically, there are two situations that Intercept bias is often assessed through the
can occur. The first is differential validity or slope analysis of item difficulty. Obviously, if there are
bias. Here we have a situation where the rela- mean differences between samples, there will be
tionship between test scores and criterion scores differences in item difficulty; in fact, we can think
(for example, SAT scores and predicted GPA) is of the mean as reflecting average difficulty. To
substantially greater in one ethnic group than in determine item bias we look for items that do not
another. Because the correlation, or regression follow the expected pattern. Thus, for a minority
line, is represented statistically by the slope of sample, one item may be substantially more dif-
graphed data, this is called slope bias. In fact, ficult than similar items; such an item needs to
the literature suggests that slope bias is the result be inspected to determine what is causing these
of poor experimental procedure, differences in results, and it should possibly be eliminated from
sizes of samples from majority and minority pop- the test.
ulations, or chance findings (e.g., J. E. Hunter, In fact, when these procedures are used in an
Schmidt, & R. F. Hunter, 1979). objective, scientific manner, the finding is that
To examine slope bias we look at group differ- cognitive tests are generally not biased against
ences by using a regression equation to predict minorities.
from test scores (e.g., IQ) to criterion scores (e.g.,
GPA). Two questions are typically asked here: Bias and decision theory. Tests are used by edu-
(1) Can the same regression equation be used cational institutions and by some businesses to
for the two different groups? and (2) Does the screen applicants for admission or hiring. Not
regression equation overpredict or underpredict entirely for unselfish reasons, most of these insti-
for either group? tutions would like to make decisions that are
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

278 Part Three. Applications of Testing

culturally and/or ethnically fair. Petersen and (see Chapter 3), this model looks at the number
Novick (1976) discuss this process and the vari- of false negatives and false positives as related to
ous models for culture-fair selection. They point the number of true negatives and true positives.
out that the selection situation is basically the This approach, which seems as reasonable as the
same; we have a group of applicants about whom definition given above, leads to rather different
decisions are to be made on the basis of some conclusions (Schmidt & Hunter, 1974).
information. The information is processed by 3. Conditional probability model. Cole (1973)
some strategy, some set of rules, leading to a argued that all applicants who, if selected, are
decision to either admit or not admit, hire or capable of being successful on the criterion
not hire. There are consequences that result from should be guaranteed an equal opportunity to be
that decision, not only in terms of the individual selected, regardless of ethnic membership. The
who may find himself with or without a job, but focus is on the criterion: if a person can achieve
also in terms of the outcome – the individual’s a satisfactory criterion score (e.g., GPA of C or
performance after the assignment. Several differ- above), then that person should have the same
ent models assess whether the selection strategy probability of being selected, regardless of group
is indeed culture fair. The four main models are: membership. In terms of decision theory, this
1. The regression model. This is probably the model looks at the number of true positives in
most popular and usually associated with Cleary relation to the number of true positives plus false
(1968). This model defines fairness as identical negatives (i.e., sensitivity).
regression lines for each sample (e.g., blacks and 4. The equal probability model. This model
whites) and therefore the use of a common regres- argues that all applicants who are selected should
sion equation. In effect what this means is that we be guaranteed an equal chance of being success-
disregard race. For example, if we believe that SAT ful, regardless of group membership. This model
scores are the most objective and valid predictor looks at the true positives compared with the
of grades, then we will admit those applicants true positives and false positives (i.e., predictive
who score highest on the SAT regardless of their value).
ethnic background. These approaches are highly statistical and
2. The constant ratio model. Thorndike (1971) involve sophisticated analyses that go beyond the
pointed out that it is not enough to consider the scope of this book.
regression line; we need to also consider the pro-
portion of applicants admitted from each sample. Logical solutions. Not all theoretical models
For example, we might find that the SAT predicts regarding test bias are psychometric in nature.
GPA equally well for blacks and for whites, but Darlington (1976) for example, suggested that
that blacks as a group score lower on the SAT selection strategies be based upon “rational” rules
than whites. If we select the top-scoring appli- rather than “mechanical” ones. He felt, for exam-
cants on the SAT, we will select proportionally ple, that a college should determine the number
more whites than blacks. We need to take into of minority applicants to admit in much the same
account such an outcome, essentially by using way that it determines the number of athletes to
different decision rules for each sample, based admit. In other words, the strategies are to be
on either statistical criteria or “logical” criteria determined by the policymakers rather than by
(e.g., if 25% of applicants are black, then 25% “psychometric technicians.”
of admissions should also be black). Thorndike
(1971) argued that a test is fair if it admits or Nature vs. nurture. Central to the issue of bias
selects the same proportion of minority appli- is the question of nature vs. nurture. For exam-
cants who would be selected on the criterion ple, is intelligence determined by our genes
itself. For example, if we know that 40% of minor- (i.e., nature), or is it heavily influenced by edu-
ity applicants to our college equal or exceed the cational experiences, family setting, and other
average majority-group member in GPA, then if environmental aspects (i.e., nurture). These are
we select 50% of the majority applicants on the the two major perspectives that are used to
basis of their SAT scores, we need to select 40% of explain racial differences on cognitive measures,
minority applicants. In terms of decision theory although other perspectives have been presented
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 279

(e.g., J. E. Helms, 1992). A basic point is that themselves. Thus the differences that do exist in
the question, “Is it nature or nurture?” is not mean test scores are not reflective of ethnicity per
a good question to ask. Any complex behavior se, but of socioeconomic status. Poor whites and
such as that reflected in our label of intelligence poor blacks do less well than their more advan-
is the result of myriad influences, including both taged peers. Yet the fact remains that carefully
hereditary and environmental aspects that inter- done reviews of the literature point to hered-
act with each other in complex ways. To go over ity as a major source of variation in intelligence
the various historical issues and findings associ- scores (A. R. Jensen, 1969), and sober-minded
ated with this controversy would take us far afield reflections suggest that most criticisms of tests as
and not serve our purposes; the story has been culturally biased are the result of prejudice and
told countless times and the reader can consult preconceptions (Ebel, 1963).
such sources as Samuda (1975) for an overview.
A historical note. Unfortunately, much of the lit-
Some findings. A number of studies have shown erature on this topic seems to be severely flawed.
that the mean score on cognitive tests for blacks Early studies such as those of Goddard (1913;
in the United States is significantly and consis- 1917), who tested immigrants to the United
tently lower than that of whites. These results are States and found 80% of them to be “feeble-
fairly factual and, in general, are not disputed. minded,” did not take into account such obvious
What is disputed is the interpretation. On one aspects as lack of knowledge of English. More
side, individuals attempt to explain such findings current studies make sweeping generalizations
as reflecting “nature,” that is hereditary and/or and often reflect political vehemence rather than
genetic differences. On the other side, are those scholarly deliberation (R. L. Williams, 1971).
who attribute the findings to “nurture,” to aspects Most test authors have been aware of the
such as differences in nutrition, school environ- limitations and potential misuses of their
ments, role models, etc. In the middle, are the instruments; Binet himself cautioned profession-
interactionists who believe that such mean differ- als on the limitations of his newly developed
ences are the reflection of both nature and nur- measure of intelligence. Unfortunately, many of
ture as they interact in ways that we are barely the pioneers in the field of intelligence testing
beginning to understand. Off to one side, are in the United States, notably Terman and God-
those that believe that such findings are an arti- dard (see Chapter 19), were not as “scientific”
fact, that they reflect inadequacies and biases as they ought to have been. Goddard, for exam-
inherent in our instruments. Still others believe ple, administered the translated Binet-Simon to
that there is a conspiracy afoot with testing serv- arriving European immigrants and concluded
ing as a gate-keeping function, a way of keeping that 83% of Jews, 80% of Hungarians, 79% of Ital-
minorities “in their place.” ians, and 87% of Russians were feeble-minded. In
Studies that support one particular point of opposition to such misguided “findings,” a num-
view are both easy and hard to find – easy because ber of early studies can be cited where the authors
research results can sometimes be interpreted in were sensitive to the limitations of their instru-
accord with one’s preferred theoretical stance, ments. For example, Yerkes and Foster (1923)
and hard because studies that control for pos- argued that the interpretation of a person’s IQ
sible confounding aspects are difficult to carry score should be made in the context of the per-
out. The Coleman Report (Coleman et al., 1966) son’s socioeconomic background, as well as edu-
was a survey of schools attended by blacks and cational and familial history.
by whites. The authors showed that ethnic dif- Some investigators concluded that nonver-
ferences in tests of intelligence and in tests of bal measures of intelligence were the appro-
academic achievement were not related to differ- priate instruments to use with minority chil-
ences in school curricula, physical facilities, or dren (e.g., Garth, Eson, & Morton, 1936), while
teacher characteristics. The authors concluded others pointed to differences in home environ-
that when socioeconomic background is con- ments, educational deficits, and other differen-
trolled for, there is little difference in test per- tial aspects between Anglo and minority children
formance that can be attributed to the schools (e.g., G. Sanchez, 1932; 1934). Other investigators
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

280 Part Three. Applications of Testing

administered the same test in an English format educational difficulties. Perhaps, tests are fair
and a Spanish format to Spanish-speaking chil- when used with average children but biased
dren, and found that the children scored signif- when used with lower-functioning children.
icantly higher on the Spanish version than on In a typical study Poteat, Wuensch, and Gregg
the English version (e.g., Mahakian, 1939; A. J. (1988) reviewed 83 black and 85 white students,
Mitchell, 1937), although later studies that con- referred for special-education evaluations. These
trolled for degree of bilingualism found just the children ranged in age from 6 to 16, with a
opposite results (e.g., Keston & Jimenez, 1954). median age of 10 and represented some 20
Finally, other investigators began to look at the different schools. Of these students, 41% were
administrative aspects of tests as possible sources eventually placed in programs for learning dis-
of bias. R. R. Knapp (1960) for example, admin- abled, 10% were identified as educable mentally
istered the Cattell Culture Fair Intelligence Test to handicapped, and 11% as needing other forms
Mexican boys, with no time limit, and found that of special education.
they scored higher than an Anglo sample where For the black students, the mean WISC-R IQ
the test was given with the standard time limit. was 79.5 while for the white students it was
94.1. Similarly, the mean GPA for black students
Diagnostic discrimination. It is sometimes was 2.73 and for white students 3.09. A signifi-
argued that test bias is not so much a matter of the cant mean difference was also obtained on aver-
test being biased; the bias occurs in the use of test age scores on the California Achievement Test.
results to label minority children and to place WISC-R Full Scale IQs were significantly corre-
a disproportionate number of these in special- lated with GPA for both black students (r = .32)
education programs, that is, to label these chil- and for white students (r = .42), with differences
dren as mentally retarded. In fact, the empirical in the regression line not significant. A variety of
evidence argues just the opposite. A number of other statistical analyses again indicated no sig-
studies show that black and low-socioeconomic- nificant differences between black and white stu-
class children are less likely to be recommended dents in the differential validity of the WISC-R.
for special-education-class placement than their
white or higher-socioeconomic-class peers (C. R. Determining item bias. One approach is to
Reynolds, 1982). determine item difficulty separately for the
majority group and for the minority group.
Language and examiner bias. Some studies In fact, item bias is now called “differential
have shown significant increases in mean scores item functioning” to reflect this perspective. If
of black children when a test has been admin- any item seems to be particularly difficult for
istered using standard vs. nonstandard English one group, relative to other items on the test,
(e.g., Hardy, Welcher, Mellits, et al., 1976) but then the item is considered potentially biased.
others have not (e.g., L. C. Quay, 1974). C. R. When in fact such items are identified during
Reynolds (1982) points out that such studies test construction, there seem to be two possible
do not include experimental and control groups explanations: (1) the items are poorly written and
of white children. Jencks (1972) concluded that of low reliability, and hence ought to be removed;
there was no evidence to support the hypothesis (2) the items seem well written, with adequate
that black children are more disadvantaged on reliability, and do not share any common char-
verbal tests where language is important than on acteristics that might provide a reasonable expla-
nonverbal tests where language is at a minimum. nation. When such items are eliminated from a
Others (e.g., Oakland & Matuszek, 1977) have test, the results are not particularly different from
concluded that having a white examiner does those that are obtained with the items retained.
not alter the validity of test results for minority What seems to occur is that the test becomes
children. slightly more difficult for everyone because the
eliminated items typically have moderate to low
Special-education children. One particular difficulty (C. R. Reynolds, 1982).
concern might be with children who are Another approach is to have expert minor-
tested because of possible retardation and/or ity group members review proposed test items
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 281

and to eliminate any items the experts judge priate norms. If we are assessing how depressed
to be biased, either in content, wording, scor- a psychiatric patient is, it would make sense to
ing, or other aspect. Unfortunately, the research compare that person’s score on the depression
literature indicates that such judgments show inventory we used with norms based on psy-
little relationship to actual empirical findings. chiatric patients. If the subject were a college
Sandoval and Mille (1979, cited by C. R. student, then more appropriate norms might
Reynolds, 1982), for example, asked a sample of be based on college students. The appropriate-
100 judges from Mexican-American, Anglo, and ness of norms must be determined by their
black backgrounds to judge which of 30 WISC- relevance to a criterion. If there were no dif-
R items would be of greater difficulty for spe- ference in criterion performance, i.e., depres-
cific ethnic groups. The authors concluded that sion, between college students and psychiatric
judges are not able to identify items that are more patients, then separate norms would not be
difficult for a minority child than for an Anglo appropriate.
child; both minority and nonminority experts In effect, using racial norms prejudges the
were equally incorrect in their subjective judg- question and assumes that the minority group
ments vis-à-vis, empirical data. will do less well on the criterion. This approach is
When there is such cultural bias in test items, it exemplified by the work of Mercer on the SOMPA
is usually related to the specificity of item content. (see section below). The counterargument to the
For example, an item that requires the identifi- issue of separate norms is that the typical child
cation of the capital of Pakistan may be biased in will need to function in a pluralistic setting and
favor of individuals who live in that part of the will be working in a world that includes peo-
world. Such bias can best be established by objec- ple from many different cultures. In addition,
tive analysis based on item statistics, rather than cultural groups are not “pure”; there are many
subjective judgment (Clarizio, 1982). children who have parents of mixed ethnicity,
parents who come from different geographical
Intent of the test. One question often neglected locations (e.g., an American father and a Viet-
by critics of tests concerns the intent of a test. In namese mother), and may have different cultural
Chapter 9, for example, we discussed the Boehm and linguistic backgrounds. It is not possible to
Test of Basic Concepts. This test has been criti- develop such specific norms, and at any rate, the
cized as being culturally biased because the lan- criterion is not one’s ethnic group but one’s peers
guage of the items is “unfamiliar” to black chil- in the school system and in the world of work.
dren. For example, the item “mark the toy that is
behind the sofa” should be changed to “mark the Some general conclusions. There is a consider-
toy that is in back of the sofa.” Such arguments able body of literature on test bias, particularly
neglect the intent of the test. The Boehm assesses with the better known tests of cognitive func-
a child’s knowledge of specific language concepts tioning such as the WISC-R and the Raven (for
that are used by teachers in the classroom. If the representative illustrative studies see Dean, 1980;
concept were “in back of” then the change would McShane, 1980; McShane & Plas, 1982; Reschly,
be appropriate; but the concept that is assessed is 1978). The literature is complex, sometimes con-
“behind” (C. R. Reynolds, 1982). tradictory, and sometimes it is difficult to dis-
entangle what is factual evidence vs. subjective
Separate or racial norming. One solution that is judgment. However, we can come to some con-
sometimes proposed to perceived test bias is that clusions. In general, studies of intelligence tests
of separate norms, also called racial or subgroup have shown that these tests do not have bias
norming (D. C. Brown, 1994). This involves inter- against blacks or other minority groups (e.g.,
preting a person’s test score in comparison to C. R. Reynolds, 1982; C. R. Reynolds, Willson,
norms based on an ethnically relevant sample. & Chatman, 1985). More specifically, from an
For example, the SAT scores of a black candi- internal criterion point of view, studies show fac-
date for admission to a university would be com- torial equivalence of tests such as the WISC-R
pared to norms based on black individuals. A in Hispanics, blacks, and Anglos and equivalent
good argument can be made for using appro- indices of internal consistency in these various
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

282 Part Three. Applications of Testing

groups. In other words, the bulk of the evidence attempted to develop measures that would not
suggests that well-designed cognitive tests are not be affected by differing cultural factors such
biased. At the same time, the evidence suggests as language, literacy (i.e., ability to read), test
that nonverbal cognitive measures such as the sophistication, and so on. For example, if test
Raven show no bias, but that verbal tests may items could be developed that did not require
show some possible bias for Mexican-Americans; language, such items could be used in a test that
Mexican-American children, particularly those could be administered to groups having different
who are bilingual, do better on performance and languages, and such tests would then be “culture-
nonverbal items than on verbal items. The lan- free.” Eventually, it became apparent that valid
guage of the examiner and/or of the test seems to culture-free tests could not be developed. Behav-
have some impact with Mexican-American chil- ior does not exist in a vacuum, and culture is
dren. For example, Mexican-American children not simply an outside veneer that can be dis-
tend to score higher on such tests as the WISC carded at will. Thus, in the measurement of intel-
and the Stanford-Binet when these are given in ligence there was a shift from “culture-free” tests
Spanish rather than English. to “culture-fair” tests that need to be evaluated
From an external point of view, the litera- and validated within each culture.
ture indicates that standardized tests such as the Culture-fair tests tend to be nonverbal in
WISC-R are just as predictively accurate with nature. They use items such as those depicted
minority children as with Anglo children, par- in Figure 11.1. These items often consist of com-
ticularly when the criterion is a standardized pleting patterns, classification tasks, finding one’s
achievement test. The data is less convincing (and way out of a paper maze, and so on. Such items,
also more limited) when the criterion consists of which are typically pictorial or involve perfor-
teachers’ grades. Such conclusions also apply to mance, rather than verbal, often involve abstract
the testing of adults. For example, a number of reasoning and the solution of novel problems
studies have shown that the predictive validity rather than the more traditional verbal items
of the SAT for blacks attending black colleges is that reflect school knowledge. Sometimes the
as high as that for whites (e.g., Stanley & Porter, items are selected because they are equally unfa-
1967). miliar to different cultures, and sometimes they
On the basis of a literature review, Clarizio are presumed to be of equal familiarity. Instruc-
(1982) came to the following conclusions: tions may be verbal, but can often be given
orally, in the appropriate language, or through
1. Nonverbal tests such as the Performance scale pantomime.
of the WISC-R provide valid measures of the Unfortunately, establishing the validity of
intellectual functioning of Mexican-American culture-fair tests is problematic in part because
children. validity must ultimately rest on criteria that
2. Verbal intelligence scales have about as good are not free of culture. For example, academic
validity for Mexican-American children as they achievement occurs in a school setting with all the
do for Anglos, in predicting short-term achieve- demands, expectations, prejudices, values, etc.,
ment. However, because a specific language factor that clearly reflect a particular culture.
may depress the performance of bilingual chil-
dren on verbal scales, care needs to be exercised Problems in cross-cultural testing. There are
in arriving at specific decisions. many problems associated with cross-cultural
3. Consideration should be given in testing a research and particularly with the use of spe-
bilingual child in both languages by an examiner cific tests in different cultures. For example,
who is fluent in both languages. translations from one language to another may
result in instruments that are not really equiva-
lent. Psychological constructs such as depression,
ego-strength, or intelligence may not necessarily
CROSS-CULTURAL ASSESSMENT
be equivalent across languages and/or cultures.
When psychological tests were first developed, There may be crucial differences from one cul-
particularly those of intelligence, researchers ture to another in terms of test sophistication and
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 283

1. Which item on the right completes the pattern on the left?

x x x

o o o x o p

p p

2. Find the correct way out of the maze.


x

3. Select the correct choice.

FIGURE 11–1. Illustrations of items used in Culture-Fair Tests.

test-taking behavior. A test may well have con- and Peru on the CPI would presumably control
tent validity in one country, but not in another. for cross-language differences – although collo-
Holtzman (1968) divides such potential prob- quialisms and other aspects may still be different.
lems into three categories: (1) cross-national (For a brief review of many of the measurement
differences, (2) cross-language differences, and problems involved in cross-cultural testing see
(3) subcultural differences (such as ethnic ori- Hui and Triandis, 1985.)
gin and degree of urbanization). He suggests that
studies can be undertaken that control for one or Translating from one language to another. If
more of these categories. For example, a com- we wanted to determine the utility of a particu-
parison of college students in Mexico, Venezuela, lar test that was developed in the United States
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

284 Part Three. Applications of Testing

in another country, such as Venezuela, for exam- Etic and emic approaches. Two words that are
ple, we would translate the test into Spanish and used by linguists, “phonetic” and “phonemic,”
carry out the necessary assessments of reliabil- refer to language. Phonetic refers to the univer-
ity and validity in Venezuela. In fact, most of sal rules for all languages, while phonemic refers
the major commercially published tests, such to the sounds of a particular language. From
as the Stanford-Binet, the WISC, the MMPI, these terms, the words “etic” and “emic” were
and the CPI, are available in various languages. derived and used in the cross-cultural literature.
However, translation is not a simple matter of Etic studies compare the same variable across cul-
looking up the equivalent words in a dictionary. tures. For example, we might be interested in
Common objects in one culture may be uncom- depression and might administer a depression
mon in another; bland words in one language scale to nationals of various countries. Emic stud-
may have strong affective meanings in another. ies focus on only one culture and do not attempt
Phrases may be translated correctly in a literary to compare across cultures. We might admin-
sense, but not be colloquially correct; maintain- ister a depression questionnaire to a sample of
ing the language equivalence of an instrument Malaysian adults, for example, and determine
across languages can be difficult. how scores on the questionnaire are related to
Brislin (1970) suggested that the procedure to everyday behaviors in that culture.
be used, called the back translation method, follow A good example of an emic study is that by
certain steps. The items to be translated should Laosa (1993) who studied what family charac-
be simple, not use hypothetical phrasings or sub- teristics are related to a child’s school readiness
junctive mood, avoid metaphors and colloqui- in Chicano children. He studied normal young
alisms. The items are translated from the source children in 100 two-parent Chicano households
language to the target language by a bilingual of widely varied socioeconomic levels. This was
person. A second bilingual person translates the a longitudinal study with data collected when
items back from the target language to the source the children were 30 months, 42 months, and
language. The two source language versions are 48 months of age. The data collection involved
then compared; any discrepancies hopefully can interviews with the mother, administration of the
be resolved. Culture Fair Intelligence Test (see discussion that
There is also a process known as decentering, follows) to both parents, and administration of
which refers to the translation process in which a preschool achievement test to the children as
both the source and the target language versions a measure of school readiness. Among the many
are equally important – i.e., both the source and findings were that children’s school readiness was
the target versions contribute to the final set of related to the father’s and the mother’s levels of
questions, and both are open to revisions. schooling (r = .54 and .46), and to the father’s
A number of other procedures are also used. and mother’s scores on the Culture Fair Intel-
In addition to the back translation, bilinguals can ligence Test (r = .52 and .32). In other words,
take the same test in both languages. Items that children who were better prepared to enter school
yield discrepant responses can easily be identi- had better educated and more intelligent parents.
fied and altered as needed or eliminated. Rather In particular, the results of this study suggest an
than use only two bilingual individuals to do the important influence of Chicano fathers on their
back translation, several such individuals can be children’s learning and development. For another
used, either working independently or as a com- interesting emic study, see M. King and J. King
mittee. It is also important to pretest any trans- (1971) who studied more than 1,100 freshmen
lated instruments to make sure that the translated at a university in Ethiopia to see what variables
items are indeed meaningful. correlated most with academic achievement.
Quite often, tests that are developed in one
culture are not simply translated into the target
MEASUREMENT OF ACCULTURATION
language, but they are changed and adapted. This
is the case, for example, with the Cuban version Assessing acculturation. What is meant by
of the Stanford-Binet (Jacobson, Prio, Ramirez, acculturation? Although it is a straightforward
et al., 1978). question, there is no simple answer. As Olmedo
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 285

(1979) stated, the term acculturation is one of use, proficiency, and preference. Psychological
the most elusive yet ubiquitous constructs in the items deal with values, attitudes, knowledge,
behavioral sciences. For our purposes, accultur- and behavior. Sociocultural items typically cover
ation involves all the myriad aspects and pro- occupational status, educational level, family
cesses that impinge on an individual from one size, degree of urbanization, and similar aspects.
culture as that person enters a different culture A number of scales to measure acculturation
for an extended period of time. The process of have been developed, primarily for Hispanics
acculturation refers to the changes in behav- (e.g., Cuellar, Harris, & Jasso, 1980; Deyo, Diehl,
ior and values that occur in minority individu- Hazuda, et al., 1985; Franco, 1983; Mendoza,
als as they are exposed to the mainstream cul- 1989; Olmedo, J. L. Martinez, & S. R. Martinez,
tural patterns. Acculturation can be seen as a 1978; Olmedo & Padilla, 1978; Ramirez, Garza, &
group phenomenon, which is how anthropol- Cox, 1980; Triandis, Kashima, Hui, et al., 1982);
ogists and sociologists study it, or as an indi- a smaller number of scales have been developed
vidual phenomenon, which is how psycholo- for Asians (Suinn, Ahuna, & Khoo, 1992; Suinn,
gists see it. However defined, acculturation is Rikard-Figueroa, Lew, et al., 1987). A few inves-
a multifaceted and gradual process. Concrete tigators have developed scales to measure spe-
objects such as clothing can be adopted rather cific subgroups within the broader labels, such
rapidly, but changes in values may take longer, as, for Cubans. Szapocznik, Scopetta, Aranalde,
or never occur. Acculturation is an important et al., (1978) developed a 24-item behavioral-
variable, not only intrinsically, but because it is acculturation scale for Cubans, with a high degree
related to a variety of physical and psycholog- of internal reliability (alpha = .97) and test-
ical conditions ranging from alcoholism, edu- retest stability over a 4-week period (r = .96).
cational achievement, suicide, mortality rates, Garcia and Lega (1979) developed an 8-item
willingness to use counseling, and others (e.g., Cuban Behavioral Identity Questionnaire; one
Leighton, 1959; Hong & Holmes, 1973; Padilla, interesting but somewhat burdensome aspect of
1980; Ruesch, Loeb, & Jacobson, 1948; R. Sanchez this scale is that responses to this scale, given on a
& Atkinson, 1983). 7-point Likert format, are placed in a regression
There seems to be substantial agreement in equation to obtain a total score. Most of these
the literature that the major dimension under- scales are applicable to adolescents and adults,
lying acculturation and acculturation scales is but some have been developed specifically for
language use. If a person speaks Spanish but lit- children (e.g., Martinez, Norman, & Delaney,
tle or no English, then that person is assumed 1984).
not to be acculturated. Another major dimen- Most acculturation scales assume accultura-
sion underlying acculturation is generational dis- tion to be a unipolar variable, i.e., a person is more
tance, defined by where the respondent and or less acculturated. Some investigators see accul-
family members were born. A first generation turation as a bipolar variable – a person can also
person is someone whose parents, grandparents, be bicultural, equally at ease in both cultures. For
and self were born outside the United States, example, Szapocznik, Kurtines, and Fernandez
while a second generation person was born in (1980) developed separate scales to measure
the United States with both parents and grand- “Hispanicism” and “Americanism” in an attempt
parents born outside of the United States. (Third to measure biculturalism, the degree to which the
generation Mexican-Americans are assumed to person identifies with both cultures.
be more acculturated than second generation,
who in turn are more acculturated than first.) The Marin scale. A representative example of
Generational distance is often used as a criterion an acculturation scale for Hispanics is the one
for determining the construct validity of accul- developed by G. Marin (G. Marin, Sabogal, &
turation scales. B. V. Marin, 1987). The authors selected 17
Acculturation scales typically use one or more behavioral-acculturation items from previously
of three types of items: linguistic, psycholog- published acculturation scales. These items mea-
ical, and sociocultural. Almost all scales use sured proficiency and preference for speaking
linguistic items, having to do with language a given language in various settings (e.g., what
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

286 Part Three. Applications of Testing

language(s) do you read and speak: (a) only Span- most significant item was, “only English spoken
ish, (b) Spanish better than English, (c) both at home,” with affirmative responding subjects
equally, (d) English better than Spanish, (e) only likely to be Anglo. Conversely, those endorsing
English. Other questions with similar response the item “mostly Spanish spoken at home” were
options ask what language(s) are spoken at home, likely to be Chicano.
with friends, as a child, etc.). The items were Test-retest reliability over a 2- to 3-week period
placed within the context of a 16-page question- for a group of 129 junior college students was
naire, and administered to a sample of 363 His- reported to be .89 for Chicanos and .66 for Ang-
panics and 228 non-Hispanic white. Both English los. A factor analysis of the 20 items indicated
and Spanish versions were available. 3 factors: factor I was labeled a Nationality-
The responses for the two samples were fac- Language factor, factor II a socioeconomic status
tor analyzed separately. For the Hispanic sample factor, and factor III a semantic factor. This scale
three factors were obtained. The first accounted has been used in a number of studies, with col-
for 54.5% of the variance and was called “lan- lege students (Padilla, Olmedo, & Loya, 1982),
guage use and ethnic loyalty.” This factor was and with community adults (Kranau, Green, &
made up of seven items that measured language Valencia-Weber, 1982; Olmedo & Padilla, 1978).
use and the ethnicity of important others. The The scale has been cross-validated (Olmedo &
second factor accounted for only 7% of the vari- Padilla, 1978), and a Spanish version developed
ance and included four items on preference for (Cortese & Smyth, 1979).
media (e.g., Spanish language TV). The third
variable accounted for 6.1% of the variance and The ARSMA. Another popular acculturation
included four items that measured the ethnicity scale is the Acculturation Rating Scale for Mex-
of friends for self and for one’s children. Sim- ican Americans, or ARSMA (Cuellar, Harris, &
ilar results were obtained for the non-Hispanic Jasso, 1980). The ARSMA consists of 20 ques-
white sample, except that the second and third tions each scored on a 5-point scale, ranging from
factors accounted for a greater portion of the Mexican/Spanish to Anglo/English. For example,
variance. On the basis of some additional sta- one item asks what language you prefer. Avail-
tistical decisions, the authors chose 12 items as able responses are (a) Spanish only; (b) mostly
their final scale. The 12-item scale was then ana- Spanish, some English; (c) Spanish and English
lyzed for reliability (alpha coefficient = .92) and about equally; (d) mostly English, some Spanish;
for validity. Scores on the scale correlated signifi- (e) English only. The total scores on this scale
cantly with generational distance, with length of yield a typology of five types: very Mexican;
residence in the United States (taking age into Mexican-oriented bicultural; “true” bicultural;
account), and with the respondent’s own evalu- Anglo-oriented bicultural; and very Anglicized.
ation of their degree of acculturation. Four factors have been identified on the scale:
(1) language preference; (2) ethnic identity and
The Olmedo, Martinez, & Martinez (1978) Scale. generation removed from Mexico; (3) ethnicity
This is a two part paper-and-pencil inventory of friends and associates; (4) direct contact with
that started with 127 items. The first part con- Mexico and with ability to read and write in Span-
sists of a semantic differential in which four con- ish. Both internal reliability (alpha = .88), and
cepts (mother, father, male, and female) are rated test-retest reliability (.80 for 4- to 5-week period)
on a set of 15 bipolar adjectives, all reflective are adequate. This scale and its factors was cross-
of a potency dimension (e.g., hard-soft, weak- validated (Montgomery & Orozco, 1984) and
strong). The second part consists of 18 items that used in a variety of studies (e.g., Castro, Furth, &
cover background information such as gender, Karlow, 1984). Some of the items have been incor-
place of birth, family size, and language spoken porated into a semistructured interview measure
at home. The scale was originally administered to of acculturation (Burnam, Telles, Karno, et al.,
some 924 high-school students, of which about 1987).
27% were Chicanos. A series of analyses yielded
a set of 20 variables (9 semantic and 11 socio- SL-ASIA. Most of the acculturation scales that
cultural) that were correlated with ethnicity. The have been developed are for Hispanics. One scale
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 287

that was developed for Asians but modeled on and knowledge, reflecting particular educational
the ARSMA is the Suinn-Lew Asian Self-Identity experiences, and it is therefore culture related
Acculturation Scale or SL-ASIA (Suinn, Rickard- (R. B. Cattell, 1963; 1987). Crystallized intelli-
Figueroa, Lew, et al., 1987). The SL-ASIA consists gence develops through the use of fluid intelli-
of 21 multiple-choice items, written to mimic gence, and the two are in fact highly correlated.
the ARSMA items, and covering such aspects as R. B. Cattell’s test is composed of three scales:
language preference, identity, and generational Scale I for ages 4 to 8, Scale II for ages 8 to
background. Suinn, Ahuna, and Khoo (1992) 12 and “average adults,” and Scale III for high-
administered the SL-ASIA and a demographic school students and superior adults. Scale I con-
questionnaire to a sample of 284 Asian-American sists of eight subtests that involve mazes, copy-
college students, with a mean age of 24.4 years. ing of symbols, identifying similar drawings, and
Internal consistency estimates across several other nonverbal tasks. Both Scales II and III are
studies range from .88 to .91 (Atkinson & Gim, composed of four subtests: (1) a Series subtest
1989; Suinn, Ahuna, & Khoo, 1992; Suinn, where a sequence of drawings is completed by
Rickard-Figueroa, Lew, et al., 1987). Scores on choosing among response options; (2) a Clas-
the SL-ASIA correlated significantly with such sifications subtest, where the respondent selects
demographic variables as total years living in the one drawing that is different from the other
the United States (r = .56), total years attending drawings; (3) a Matrices subtest that requires
school in the United States (r = .61), years lived completing a matrix or pattern; and (4) a Con-
in a non-Asian neighborhood (r = .41), and self- ditions subtest, that requires the respondent to
ratings of acculturation (r = .62). A factor anal- identify which of several geometric drawings ful-
ysis indicated five factors, three of which were fills certain specified conditions. Two forms are
identical to those found on the ARSMA. These available, forms A and B, which are combined and
factors were: (a) reading/writing/cultural prefer- administered as a single scale in the standardiza-
ence (accounting for 41.5% of the variance), (b) tion process. Karnes, May, and Lee (1982) report
ethnic interaction (10.7% of the variance), and a correlation of .59 between scores on Form A and
(c) generational identity (5.9% of the variance). scores on Form B for a sample of economically
The two additional factors were: (d) affinity for disadvantaged children.
ethnic identity and pride (6.6% of the variance), Almost as soon as the Cattell was published, it
and (e) food preference (5%). These 5 factors was criticized. For example, Marquart and Bailey
involve 17 items, so each factor is made up of 2 (1955) argued that performance on the items of
to 5 items. Again, the one major factor involves Scale I was in fact influenced by socioeconomic
language. status, just as much as on verbal measures like the
Stanford-Binet.
On the other hand, there is a substantial body
SOME CULTURE-FAIR TESTS
of literature that suggests that culture-fair tests
AND FINDINGS
such as the Cattell fulfill not only theoretical and
social concerns but practical needs as well. For
The Cattell Culture-Fair Intelligence Test
example, juvenile courts often require screening
This test was first published in 1944 and was one of intellectual ability in a population that is over-
of the first attempts to develop an intelligence represented with minority groups. Smith, Hays,
measure free of cultural influences. The test was and Solway (1977) compared the Cattell Culture-
presumed to be a measure of “g” and reflect R. B. Fair Test and the WISC-R in a sample of juve-
Cattell’s theory of fluid intelligence and crystal- nile delinquents, 53% of whom were black or
lized intelligence. Fluid intelligence is made up of Mexican-American. The results indicated signif-
abilities that are nonverbal, that do not depend on icant ethnic differences, with whites scoring 17.9
specific exposure to school or other experiences, points higher on the WISC-R and 11.4 points
and therefore are relatively culture free; basically, higher on the Cattell than their minority peers.
fluid intelligence is general mental capacity for Scores on the Cattell correlated .76 with the
problem solving, especially in novel situations. WISC-R Full Scale IQ, .71 with the Verbal IQ,
Crystallized intelligence refers to acquired skills and .70 with the Performance IQ. The authors
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

288 Part Three. Applications of Testing

concluded that the Cattell is a better measure of Because of its relationship to the concept of g,
intelligence for minority groups than the WISC- there is the temptation to consider that whatever
R, as it lessens the effect of cultural bias and the Raven’s measures it is immutable and heredi-
presents a “more accurate” picture of their intel- tary, but the evidence is quite to the contrary. For
lectual capacity. example, Irvine (1969) reports on a series of stud-
ies of eithth and tenth graders in Central Africa.
He reports that when special introductory proce-
dures were used involving the teaching of sample
Raven’s Progressive Matrices
problems, there was a decrease in the variance
The Raven’s PM consists of a series of three tests: of the scores and an increase in the mean. There
the Standard Progressive Matrices (SPM; J. C. were also differences in mean scores between spe-
Raven, 1938), the Coloured Progressive Matri- cific schools, and in item difficulty levels among
ces (CPM, J. C. Raven, 1947a), and the Advanced specific ethnic groups.
Progressive Matrices (APM, J. C. Raven, 1947b). Despite the popularity of the Raven’s PM,
These tests are based on Spearman’s two- they have been criticized rather substantially. For
factor theory, which distinguished between g example, Bortner (1965) indicated that the diag-
or general intelligence and s or specific factors. nostic value of a test of cognitive functioning
Spearman’s theory also distinguished between comes from an analysis of the errors made by
“eductive” and “reproductive” thinking pro- the subject, and from observations of the sub-
cesses, and the Progressive Matrices are designed ject’s attempts to deal with the task. The PM does
to assess a person’s ability to educe relation- not allow for such analyses and observations and
ships rather than reproduce learned material. We hence is of limited value.
might prefer to use the term inductive reasoning
and consider the Raven’s as such measures, in The Standard PM (SPM). The SPM (J. Raven,
that the examinee is presented with a collection J. C. Raven, & Court, 1998) is probably the most
of elements, needs to infer a rule or rules that widely used of the three progressive matrices
relate such a collection, and then needs to verify tests; it consists of 60 problems in 5 sets of 12. The
the rule by selecting an appropriate new element tests are called progressive because each prob-
that fits the rule(s). Others might call this “analyt- lem in a set, and each set, are progressively more
ical” intelligence, the ability to deal with novelty, difficult. Each problem consists of a geomet-
to adapt one’s thinking to a new cognitive prob- ric design with a missing piece; the respondent
lem (Carpenter, Just, & Shell, 1990), while still selects the missing piece from six or eight choices
others might prefer to use Cattell’s term of “fluid given.
intelligence.” Originally, the SPM was developed together
Each of the Raven’s PM yields only one score, with a vocabulary test (the Mill Hill Vocabulary
namely the number of items answered correctly. Scale) to assess the two components of general
These tests can be used with children, adoles- intelligence as identified by Spearman; the SPM
cents, and adults, although the evidence suggests was a measure of eductive ability, while the Mill
that the PM are of limited value and questionable Hill was a measure of reproductive ability. The
reliability for children aged under 7. In general, two scales correlate about .50, suggesting that
there do not seem to be gender differences on the two measures are somewhat distinct. While
any of the three PM (Court, 1983). The Raven’s the Mill Hill Vocabulary Test is widely known and
PM have achieved a high degree of popularity used in England, it is practically unknown in the
and have been used in significantly more than United States.
1,600 published studies (Court, 1988). In par- The SPM is untimed and can be group-
ticular, the Raven is used with groups such as administered. The test was originally standard-
children and the elderly, for whom language pro- ized in 1938 and restandardized in 1979 on
cessing may need to be kept at a minimum. As British school children, and with norms on ages
with many of the other popular tests, short forms 6 to 65. The reliability of the SPM is quite
have also been developed (e.g., W. Arthur & Day, solid. Split-half reliabilities are in the .80s and
1994; Wytek, Opgenoorth, & Presslich, 1984). .90s, with median coefficients hovering near .90.
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 289

For example, H. R. Burke and Bingham (1969) The Coloured PM (CPM). The CPM was designed
reported a corrected split-half reliability of .96. for use with young children aged 5 to 11, with
Similarly, test-retest reliabilities range from the the mentally handicapped, and with the elderly.
.70s to the .90s, with median coefficients in the The CPM contains 36 problems printed in dif-
.80s. Extensive norms are available for the SPM, ferent colors, but is the same as the SPM in other
both for English-speaking groups from countries respects. In fact, two subsets of items are identi-
such as Canada, England, Ireland, and the United cal with those found on the SPM (except for the
States, and for other countries such as China and color). There is also a vocabulary test (the Crich-
the former Czechoslovakia. For example, H. R. ton Vocabulary Scale) to be used in conjunction
Burke (1972) gives norms based on a sample of with the CPM, but it is not, at least in the United
567 male American veterans that allow one to States.
change the SPM score into an estimated WAIS The reliability of the CPM also is adequate,
IQ. There is a substantial cross-cultural literature with split-half coefficients between .82 and .99,
on the SPM (e.g., Abdel-Khalek, 1988; A. Moran, and test-retest coefficients between .67 and .86
1986). (Court & J. C. Raven, 1982). Concurrent valid-
Much of the concern about the validity of the ity with tests such as the Stanford-Binet and
SPM centers on Spearman’s theory. In the United the WISC varies widely, with correlation coef-
States, the concept of g has never been popu- ficients ranging from the .20s to the .80s. In gen-
lar, with American theorists preferring a mul- eral, higher correlations are found with older
tifactor approach, that is, theories that suggest children (ages 11 and 12). How important is
intelligence is made up of a number of sepa- the color aspect? Tuddenham, Davis, Davison,
rate dimensions. In England and other countries, et al. (1958) reproduced the items in black and
Spearman’s theory has had greater impact and white on throw-away sheets, rather than using
acceptance. A number of studies that have factor- the more expensive color-printed reusable book-
analyzed SPM data have indeed obtained one lets of the commercial version; and obtained
major factor, although some studies have found identical results to those obtained under stan-
a variety of other factors. dard administration. The CPM has been crit-
Concurrent validity of the SPM with standard icized for unrepresentative norms, but it is
tests of intelligence such as the Stanford-Binet or considered a useful instrument, particularly in
the WISC, shows correlations that range from the the assessment of minority children. In one
.50s to the .80s. Predictive validity, especially pre- study of Chicano and Anglo children (Valencia,
dicting academic achievement, is generally low. 1979), the investigator reported that when
The test appears to be a “culturally fair” measure socioeconomic status and language were held
and not to have gender bias, although the results constant, Chicano children did almost as well as
of different studies are contradictory. Anglos.
S. Powers and Barkan (1986) administered the In another study (J. S. Carlson & C. M. Jensen,
SPM to 99 Hispanic and 93 non-Hispanic sev- 1981), the CPM was administered individually
enth graders and compared their scores with to some 783 children aged 51/2 to 81/2; these chil-
the scores on a standardized norm-referenced dren included 301 Anglo, 203 black, and 279 His-
achievement test (California Achievement Test). panic. For all children together, the alpha relia-
SPM scores correlated .40 with reading achieve- bility was .82, the K-R reliability was .82, and the
ment scores, .45 with language achievement, and corrected split-half reliability was .85. However,
.49 with mathematics achievement, with no sig- for the youngest children (51/2 to 61/2) these coef-
nificant differences between Hispanic and non- ficients were .57, .64, and .65, respectively. When
Hispanic students in the magnitude or pattern of the results were analyzed by ethnicity, the authors
correlations. concluded that the CPM appeared to be equally
Among the criticisms leveled at the SPM is that reliable for all three ethnic groups. However, the
it tends to overestimate IQ compared with tests coefficients they report are as follows: for Anglo
such as the Wechsler and has a restricted ceiling, .83, .83, and .87; for Hispanic .76, .76, and .77;
that is, it may be too easy for a number of subjects and for blacks .76, .76, and .81, suggesting slightly
(Vincent & Cox, 1974). lower reliability for non-whites.
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

290 Part Three. Applications of Testing

(Solution: blank item should be zero and 6)


FIGURE 11–2. An example of a D–48 item.

The Advanced PM (APM). The APM consists of measure g. It consists of 48 sequences of domi-
48 items similar to those on the SPM, but con- noes (4 are used as practice examples) in which
siderably more difficult; the test is intended for the subject must determine the pattern and/or
children aged 11 and older of above-average intel- sequence, and fill in the blank item. Figure 11.2
lectual ability. The test consists of two sets of illustrates an item.
items. Set 1 can be used either as a practice test In the first study to appear in the U.S. liter-
for Set 2 or as a rough screening test. Set 2 can ature, the D–48 was administered to 86 fifth-
be used either as a test of “intellectual capac- and sixth-grade students, and their scores com-
ity” when administered without a time limit or pared to grades and to achievement test scores
as a test of “mental efficiency” when used with (G. Gough & Domino, 1963). D–48 scores cor-
a time limit. Although the APM has been used related .58 and .45 with grades in the fifth- and
in a number of studies (e.g., Ackerman, 1992), sixth-grade classes. Scores on the D–48 also cor-
there seems to be relatively little psychometric related significantly with achievement test scores,
information available. What is available suggests from a low of .27 to a high of .51. In both groups
that the APM is much like the other two forms of children scores on the D–48 correlated more
in terms of reliability and validity. One major highly with grades than did the scores on the
difference between the APM and the other two achievement tests, and in the sixth-grade sam-
is the lack of norms. The test manual reports ple scores on the D–48 were a better predic-
“estimated” norms rather than actual norms. tor of grades than the grades those children had
Another possible difference concerns the dimen- obtained the prior year.
sionality of this form. Because the underlying As we discussed above, in assessing test bias
theory is that of Spearman, a construct valid- one concern is the difficulty level of the items
ity approach would require the PM to be uni- across various groups. Gough and G. Domino
dimensional, i.e., a “pure” measure of g. That (1963) calculated the difficulty level of each of
seems to be the case for the SPM and the CPM, the items on the D–48 for their sample of chil-
but there seems to be some question about the dren and changed these indices to ranks (the eas-
APM: some studies have reported a single factor iest item was given a rank of 1, etc.). They then
(e.g., Alderton & Larson, 1990; Arthur & Woehr, compared these ranks with those obtained by
1993), while others have reported two (Dillon, other investigators. For example, they obtained
Pohlmann, & Lohman, 1981). A 12-item short a correlation of +.83 between the ranks for these
form is also available (Arthur & Day, 1994). American children and the ranks for a sample
of Italian university students; .89 with ranks for
The D–48. Like the Progressive Matrices, the D– a sample of French college students; .95 with
48 began its life in the British Army, during the those for a sample of Lebanese men; and .91 with
Second World War, as a parallel test to the PM. ranks for a sample of Flemish children. Thus
This test, in various forms, is used widely in Eng- the relative difficulty level of the D–48 items
land and in various South American countries, is rather constant for different age groups and
but is lesser known in the United States. The D– for testing in different countries and in different
48 is a nonverbal-analogies test also designed to languages.
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 291

Most of the validity information for the D– backgrounds. It is composed of six tests to be
48 is of the concurrent type. For example, Welsh administered to the child and three instruments
(1966) administered the D–48, the Terman Con- administered to the parent. These tests were
cept Mastery Test (a verbal intelligence test), and selected to reflect a tripartite model that includes
a measure of creativity to gifted high-school stu- a medical perspective, a social system perspective,
dents. Scores on the D–48 correlated .49 with and a pluralistic perspective.
scores on the Terman (note that this is just about There are six measures on the SOMPA that
identical with the correlation of the Raven with reflect the medical perspective. These include:
a vocabulary test), and did not correlate with (1) the Physical Dexterity Tasks (behaviors used
the measure of creativity. Boyd and Ward (1967) by physicians to assess sensorimotor coordina-
administered the D–48, the Raven PM, and a tion) are used to screen for possible neurolog-
group intelligence test to a sample of college stu- ical or central nervous system anomalies; (2)
dents. D–48 scores correlated .20 with GPA, .39 the Bender Visual Motor Gestalt Test (see Chap-
with the Raven’s, and .57 with the intelligence ter 15), a widely used measure of perceptual
test. The D–48 did slightly better than the Raven’s and neurological aspects, involves the copying
in predicting GPA, but not as well as the group of a series of designs; (3) the Health History
intelligence test. Inventory, a series of questions about the child’s
These results are in accord with might be birth and medical history (asked of the parent);
called differentiation theories of cognitive func- (4) weight; (5) height; and (6) visual acuity,
tioning. These theories postulate that cognitive as measured by the Snellen charts. These last
abilities become more differentiated as the child three measures are usually available in nursing
matures. At the preschool level the notion of g records. Basically, the intent of all these tasks
appears sensible. As cognition develops with age, is to identify potential medical and physical
g becomes differentiated into specific abilities. problems.
Thus, at the younger school grades we would The social system perspective fundamentally
expect a measure of g to correlate more sub- focuses on social deviance, which is defined as
stantially with achievement than at the upper specific behaviors in terms of norms appropriate
school levels. Such a view is indeed supported to a specific group. There are two instruments
by factor-analytic studies of instruments such as that reflect this perspective. One is the Adaptive
the Stanford-Binet (Thorndike, Hagen, & Sattler, Behavior Inventory for Children (ABIC), which
1986a). was developed on the basis of extensive inter-
Only future research can indicate how useful views with mothers of children from the Anglo,
this test might or might not be, but Gough and black, and Hispanic cultures. The parent is asked
G. Domino (1963) list five reasons to pay atten- to evaluate the child’s competence in six areas
tion to the D–48: (1) American psychologists such as family role and student role; the intent
should be aware of what psychologists in other here is to obtain an estimate of how the child
countries are doing; (2) the D–48 uses stimuli meets specific cultural demands for independent
(domino sequences) that are familiar to people functioning, as judged by a parent. The test is
of most cultures; (3) the D–48 is almost entirely said to be culturally fair because ethnic and gen-
nonverbal; (4) the British literature suggests that der differences in mean scores are negligible. The
the D–48 is more highly loaded on g than even second test under this perspective is the WISC-
the Raven’s; and (5) the D–48 is easy to admin- R. However, the test is not used as a measure
ister and score, requires a brief time period (30 of ability, but as a measure of functioning in
minutes), and can be administered individually school.
or in groups. The pluralistic perspective uses four sociocul-
tural scales to determine how much an indi-
vidual’s world differs from the Anglo core cul-
The System of Multicultural Pluralistic
ture. These scales assess socioeconomic status,
Assessment (SOMPA)
degree of Anglo cultural assimilation, and degree
The SOMPA (J. R. Mercer, 1976; J. Mercer & of integration into Anglo social systems. Multi-
Lewis, 1977) is really a test battery designed ple regression procedures are then used to pre-
to assess children from culturally different dict a normal distribution of WISC-R scores for a
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

292 Part Three. Applications of Testing

particular sociocultural group. A child’s relative reflective of test bias, then there are a number
standing then provides an index of “estimated of options. Of course, one is to label tests as racist
learning potential.” Note then, that scores on the and not to use them. Another is to equate test
WISC-R are not compared with test norms, but performance. Such equating can be done in a
with specific sociocultural norms, so that a child’s variety of ways, from assigning extra points to the
performance is compared with that of members scores of minority members to the use of separate
of his or her own sociocultural or ethnic group; norms. The SOMPA system in fact provides a cor-
that is what the term “pluralistic” in the title of rected estimate of intellectual abilities labeled as
the SOMPA refers to. the estimated learning potential by providing sep-
What is unique about the SOMPA – and a arate norms based on ethnicity (black, Hispanic,
source of controversy – is the theoretical ratio- and white) and various cultural characteristics
nale, the way norms are used, and the structure such as family structure.
of the battery. The SOMPA comes out of a plural- The SOMPA has been criticized on a num-
istic view of American society, that is, American ber of grounds ranging from specific psycho-
society is composed of a dominant Anglo core metric issues such as an inadequate standardiza-
culture as well as many identifiable unique cul- tion sample, low test-retest correlations of some
tural groups that vary in their degree of identifi- of the components, and lack of validity data, to
cation with Anglo culture. The argument is then issues having to do with the assumptions under-
made that the more distinct and homogeneous lying the approach, such as the concept of “innate
the ethnic group, the greater the difference in the ability” (that lurking within each of us is a “true
life experiences of children from this group vs. IQ” whose expression is hampered by poverty,
the Anglo group, and so the ethnic child needs to ethnic membership, parental example, and so
be assessed with norms that are “appropriate,” on). The Hispanic group that was part of the
that is, reflective of the child’s sociocultural normative sample was 95% Mexican-American
group (Figueroa, 1979). The assessment devices children; in other parts of the United States, His-
that make up the SOMPA were selected on panic may refer to Puerto-Rican, Cuban, or of
the basis of the social and educational philoso- South American descent. The data obtained on
phy held by the author. As Humphreys (1985) the “medical” scales is not normally distributed,
indicates, the assumptions that underlie this but the scoring procedures assume such normal-
rationale reflect values rather than psychometric ity (F. Brown, 1979). Information on many of
principles. the psychometric aspects, such as the reliability
The SOMPA was designed to assess the cogni- of the sociocultural scales, is lacking. There are
tive abilities, perceptual-motor skills, and adap- also practical problems – the battery of tests is
tive behaviors of children from aged 5 to 11, extensive and may require more time and knowl-
with a particular emphasis in identifying deficits edge than the typical school psychologist may
such as mental retardation. The standardization have. Many of the SOMPA scales use a trans-
group consisted of 2,100 California children, formation of raw scores to a mean of 50 and a
aged 5 through 11, representing black, Anglo, SD of 15, rather than the more usual T scores
and Hispanic children. In part, the intent of the or IQ scores. C. R. Reynolds (1985) concludes
SOMPA was to separate the effects of physical that the SOMPA was an innovative effort, but
and sociocultural aspects and to assess cogni- that its conceptual, technical, and practical prob-
tive potential by taking into account such cir- lems are too great. (For a sober review of the
cumstances. The SOMPA was greeted with much SOMPA see F. Brown, 1979; for a review by sev-
anticipation, perhaps too much, but the crit- eral authors and a rebuttal of their criticisms see
icisms of this battery and its approach have the Winter 1979 issue of the School Psychology
been many, and they indicate that its useful- Digest.)
ness is quite limited (cf. F. Brown, 1979; Hum-
phreys, 1985; C. R. Reynolds, 1985; Sandoval,
Culture-Specific Tests
1985).
If we accept the notion that a significant mean One reaction to the difficult issues of culture-
difference between ethnic groups on a test is fair tests has been to develop tests that are
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 293

culture-specific. Perhaps the best known exam- STANDARDIZED TESTS


ple is that of the Black Intelligence Test of Cul-
We should not lose sight of the fact that there is a
tural Homogeneity, abbreviated as BITCH-100
rather substantial body of literature that supports
(R. L. Williams, 1975). This test is composed
the utility of standard tests such as the Stanford-
of 100 multiple-choice items that assess knowl-
Binet and the WISC-R with minority students.
edge about black culture and “street wiseness.”
A typical example is the study by D. L. Johnson
The test was actually standardized on black
and McGowan (1984) who evaluated the school
high-school students. In one of the few studies
performance of low-income Mexican-American
designed to assess the validity of the BITCH-100,
children. These children were tested with the
the test and the WAIS were administered to 116
Stanford-Binet at age 3, with the Stanford-Binet
white and 17 black applicants to a police depart-
and the McCarthy Scales of Children’s Abilities at
ment (Matarazzo & Wiens, 1977). As expected,
age 4, and with the WISC-R at age 7 to 9. These
the black applicants performed better on the
tests were administered by bilingual examiners
BITCH-100 and less well on the WAIS than the
using back-translation Spanish versions. Scores
white applicants. Scores on the two tests did not
on these tests were then compared with measures
correlate, and the mean BITCH-100 score for
of school performance in first or second grade;
the black applicants fell below the average of
these measures included school grades and scores
the standardization sample of high-school stu-
on the Iowa Tests of Basic Skills. Scores on the
dents, despite the fact that these police appli-
Stanford-Binet at age 3 correlated significantly
cants were better educated and more intelligent
with both grades (correlation coefficients rang-
as a group than the standardization sample. It
ing from .27 to .34) and Iowa Tests results (corre-
would seem that the BITCH-100 is not a valid
lation coefficients ranging from .23 to .34). The
measure of intelligence, even within the black
Stanford-Binet given at age 4 showed a somewhat
population, and may not even be a valid measure
similar pattern. Neither the McCarthy Scales nor
of street wiseness or knowledge of black slang
the WISC-R correlated significantly with school
(A. R. Jensen, 1980). In general, such culture-
grades, but did correlate significantly with the
specific tests do yield higher mean scores for
Iowa measures of reading and of mathemat-
members of that minority, but they have little
ics, and less so with the Iowa language mea-
if any predictive validity. From the viewpoint of
sure. In general, there is a modest relationship
understanding they can yield useful, if restricted,
between assessment of general intellectual func-
information. From the viewpoint of prediction,
tioning and evidences of that functioning in spe-
for example of scholastic attainment, they are
cific areas in school, regardless of the child’s eth-
typically not valid.
nic background. Keep in mind that school grades,
especially at the lower grades, reflect much more
The TEMAS than simply cognitive functioning. (For a review
of selected studies on the WISC-R, see A. S.
An example of a more useful test is TEMAS
Kaufman, 1979a.)
(the initials of tell-me-a-story) and meaning
“themes” in Spanish; (Costantino, Malgady, &
Rogler, 1988; Costantino, Malgady, Rogler, &
College Entrance Tests
Tsui, 1988; Malgady, Costantino, & Rogler, 1984).
The TEMAS is a projective technique based on Much of the discussion so far has centered
the Thematic Apperception Test (see Chapter 15) on assessment instruments used with children.
and is composed of 23 chromatic pictures that What about adolescents, specifically those high-
depict black and Hispanic characters in interper- school seniors who take the Scholastic Apti-
sonal situations where there is some psycholog- tude Test (see Chapter 13) as part of the appli-
ical conflict. The pictures were chosen to elicit cation procedure for college? Most studies of
themes such as aggression, sexual identity, and Mexican-American or Puerto Rican students
moral judgement. There is also a nonminor- (e.g., Astin, 1982; R. D. Goldman & Hewitt,
ity version of the TEMAS with white characters 1976; Pennock-Roman, 1990) have shown that
(R. Flanagan & DiGiuseppe, 1999). there is a difference of about 80 points on the
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

294 Part Three. Applications of Testing

SAT between minority and non-Hispanic whites Table 11–1. Comparison of Predicted GPA
(recall that the SAT scores are on a measurement Using Regression Equation vs. Actual GPA
scale where 500 is the mean, and 100 is the SD); (McCormack, 1983)
even when social, academic, or other variables Group First cohort Second cohort
are held constant, a smaller discrepancy contin-
White .38 .42
ues to exist. Yet most of the evidence also suggests Asian .57 .54
that the SAT has equal predictive validity for both Hispanic .35 .52
groups and is, therefore, not biased (e.g., Duran, Black .36 .40
1983; Pennock-Roman, 1990). The lower scores American-Indian .40 .42
are assumed to reflect the more limited English
verbal skills and the poorer academic status of
Note that the high-school GPA has a greater
Hispanics (Pearson, 1993).
weight (.73) than the SAT composite (.08), and
At the same time, a convincing argument can
the last number is merely a constant needed for
be made against the use of the SAT. Pearson
the numbers to come out on the GPA scale of 4
(1993) studied Hispanic and non-Hispanic white
to 0.
students attending the University of Miami, a pri-
An analysis of the actual data showed that
vate, medium-sized university. These Hispanic
white students obtained higher SAT scores and
students, most of whom came from middle- and
higher mean GPA than Hispanics, blacks, and
upper-middle professional families, were pri-
Asian students, although the differences were
marily U.S. born and of Cuban ancestry. Aca-
quite small, and only the differences with the
demically, they did as well as their non-Hispanic
black students were statistically significant. Using
white peers. Despite equivalent college grades,
the regression equation, the author found corre-
the SAT means for the Hispanic sample were 42
lations of predicted GPA with actual GPA as indi-
points lower on the verbal portion, and 49 points
cated in Table 11.1. These results indicate that the
lower on the math portion. Thus poorer test per-
majority equation overpredicted (by a very small
formance is not related to poorer academic per-
amount) the GPA of black, Hispanic, and Asian
formance, at least in this sample. Pearson (1993)
students in both cohorts, but that the predictive
postulated that the difference in SAT scores may
accuracy of the equation was moderately low in
be related to the information processing required
all groups, both majority and minority. The use of
when one knows two languages, or it may be
first-semester GPA as the criterion severely limits
related to working under a time limit. The data
this study. First-semester college GPA is relatively
she presents suggest that the SAT contributes very
unstable and may reflect to a greater degree than
little to the prediction of college GPA, compared
any other semester problems of adjustment to a
to high-school grades and biographical material –
relatively novel environment; a much better cri-
a conclusion supported by other studies (e.g.,
terion would have been the four-year cumulative
Crouse, 1985; R. D. Goldman & Widawski, 1976)
GPA.
(see Chapter 13).
Another illustrative example is a study by
McCormack (1983) who examined two succes-
The Graduate Record Examination (GRE)
sive entering classes at a large urban state uni-
versity, each class totaling more than 2,400 stu- Similar concerns can be expressed about the
dents. First-semester GPA was predicted from GRE (see Chapter 13), which has been used in
high-school GPA and from a SAT composite of graduate-school admission decisions for many
verbal + mathematics divided by 100. The fol- years, despite substantial criticisms and low cor-
lowing regression equation was developed on relations between GRE scores and graduate-
white students in one class: school performance. In 1988, the Educational
Testing Service, the company that publishes
Predicted GPA = high-school GPA (.73555)
the GRE, reviewed 492 validity studies that
+ SAT composite (.08050) had been done on the GRE between 1983 and
− .67447 1988. Correlations between GRE scores and
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

Testing in a Cross-Cultural Context 295

graduate grades ranged from .15 to .31 (Staff, 1956) or concluded that the obtained differences
1988). were a function of differences in socioeconomic
Whitworth and Barrientos (1990) studied 952 status and/or education (e.g., W. E. Davis, 1975;
students admitted over a 5-year period to grad- W. E. Davis, Beck, & Ryan, 1973). Fewer stud-
uate studies at the University of Texas at El Paso; ies have been done on other minorities such as
these included 320 Hispanics and 632 Anglos. Hispanics, but the issues and the findings seem
Anglos scored significantly higher than Hispan- to be the same (e.g., McGill, 1980; Plemons,
ics on undergraduate and graduate GPA as well 1977; Reilly & Knight, 1970). Despite the sub-
as on the three scores that the GRE yields – ver- stantial popularity of the MMPI and the impor-
bal, quantitative, and analytic. An analysis (using tance of possible test bias, there is little evidence
multiple regression) to predict graduate grades to interpret the racial differences that have been
indicated a correlation of .19 for Hispanic stu- obtained. What data there is suggests that these
dents and .27 for Anglos; these correlations were differences are related to differences in person-
based on both undergraduate GPA and all three ality and/or behavior rather than test bias (J. R.
GRE scores. In fact, most of the correlation was Graham, 1987).
due to undergraduate grades, and GRE scores did
not predict graduate grades in either group. The
authors listed several limitations of their study SUMMARY
including restriction of range in the variables The issue of testing in a cross-cultural context is a
studied and graduate grade inflation because difficult one. Unfortunately it cannot be divorced
usually only grades of A or B are assigned, but from the political, economic, and social conse-
nevertheless concluded that the GRE appeared to quences of decisions that need to be made. If
discriminate against Hispanics when compared there is one message, it is that those who con-
to Anglos. sider tests to be totally useless are as misguided
as are those who consider tests to be totally valid.
The MMPI with Minorities With appropriate cautions and careful applica-
tion based on the canons of acceptable reliability
A number of studies, especially in the 1960s and validity, tests can be useful but limited tools.
and 1970s, raised questions about the use of the
MMPI as a diagnostic tool with minorities, espe-
cially with blacks. Many of these studies found SUGGESTED READINGS
that blacks responded differently than Anglos to Bracken, B. A., & Fouad, N. (1987). Spanish transla-
many MMPI items and that norms based upon tion and validation of the Bracken Basic Concept Scale.
primarily white groups were misleading (e.g., School Psychology Review, 16, 94–102.
Baughman & Dahlstrom, 1972; Gynther, 1972; An interesting article that illustrates several steps that can be
Gynther, Fowler, & Erdberg, 1971; D. Miller, taken to assure that a test that is translated from a source to a
Wertz, & Counts, 1961). Reviews of MMPI stud- target language remains equivalent across cultures.
ies (e.g., Dahlstrom, Lachar, & Dahlstrom, 1986;
Holtzman, W. H. (1968). Cross-cultural studies in
Pritchard & Rosenblatt, 1980) reported that sig- psychology. International Journal of Psychology, 3, 83–
nificant differences on MMPI profiles between 91.
blacks and whites were found, that usually these
Although somewhat simplistic in approach, this is a very read-
involved blacks scoring higher than whites on able article that illustrates some of the challenges and stum-
scales F (deviant response), 8 (schizophrenia), bling blocks of doing cross-cultural research.
and 9 (hypomania), that these differences tended
to be of 1- or 2-raw points magnitude, and Huebner, E. S., & Dew, T. (1993). An evaluation of
racial bias in a Life Satisfaction Scale. Psychology in the
when groups were equated for such variables as
Schools, 30, 305–309.
education, socioeconomic status, and age, the
A typical study that looks at the possible presence of racial
differences often disappeared. However, other bias in a scale that measures overall satisfaction with life.
researchers either found no significant racial dif- The findings of no bias when this scale is used with black
ferences (e.g., J. Flanagan & Lewis, 1969; Stanton, adolescents are also quite typical.
P1: JZP
0521861810c11 CB1038/Domino 0 521 86181 0 March 4, 2006 14:19

296 Part Three. Applications of Testing

Lam, T. C. M. (1993). Testability: A critical issue in DISCUSSION QUESTIONS


testing language-minority students with standardized
achievement tests. Measurement and Evaluation in 1. Why do you think minority children do less
Counseling and Development, 26, 179–191. well on tests of cognitive abilities than majority
children?
The concept of testability refers to the degree to which a test
is appropriate and valid for a particular subgroup, in this case 2. If all tests such as intelligence tests, col-
individuals who have limited English proficiency. An inter- lege entrance exams, etc., were eliminated, what
esting and well-written article that discusses testing issues might be some of the consequences?
relevant to language-minority students.
3. Could you explain slope bias and intercept bias
Schwarz, P. A. (1963). Adapting tests to the cultural set- to someone who has not studied statistics?
ting. Educational and Psychological Measurement, 23,
4. What do you think of racial norming?
673–686.
5. Imagine that you have decided to move to
Although this report is rather old, it still offers interesting
insights into the challenges faced by an investigator who
Mexico (or some other country that you might be
wishes to use tests in a different cultural context – in this familiar with). What would be involved in your
case, Africa. becoming “acculturated”?
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

12 Disability and Rehabilitation

AIM This chapter looks at testing in the context of disability and rehabilitation.
Although we talk about the entire life span, the focus is on adults, as opposed to
Chapter 9 where the focus was explicitly on children. We look at three major cate-
gories of disability: visual impairment, hearing impairment, and physical impairment.
For each category, we look at some of the basic principles and some specific examples
as illustrations of these principles.

SOME GENERAL CONCERNS how a person organizes sensory information; as a


result, test performance is affected. Others define
Who are the disabled? The International Cen- learning disability more in the context of difficul-
ter for the Disabled undertook three national sur- ties in the basic processes involved in the use of
veys between 1986 and 1989 to obtain some fun- spoken or written language. The diagnosis how-
damental information. They reported that in the ever, is often made on the basis of self-reported
United States there are some 27 million individu- or observed behaviors (e.g., not doing well on
als, aged 16 and older, who are disabled. Approx- classroom tests when all else points to average or
imately 66% are not working, but two thirds of above-average capabilities), rather than neuro-
these say they would like a job. Among the major logical findings. Learning disabilities may mani-
barriers to employment were lack of marketable fest themselves in a wide variety of problems and
skills, lack of accessible or affordable transporta- may relate to listening, thinking, speaking, read-
tion, and feelings that employers do not recognize ing, writing, spelling, or doing mathematical cal-
the fact that they are capable of doing full-time culations. Because this topic is complex in terms
jobs. As a matter of fact, a subsequent survey of of etiology and other aspects and, at present, the
employers and managers indicated that disabled source of much controversy, we do not consider it
employees typically received good or excellent job here (see Feldman, 1990; Gaddes, 1994; Rourke &
ratings. Fuerst, 1991; or Westman, 1990, for some current
views).
Categories of disability. The four major cate-
gories of disabilities – vision impairment, hear-
ing impairment, physical or motor disabilities, Vocational evaluation. In rehabilitation set-
and learning disabilities – account for the major- tings, the focus is often on helping the client be a
ity of occasions where the application of standard productive and contributing member of the com-
tests presents a challenge, both from a psycho- munity, i.e., helping her or him obtain employ-
metric and a clinical perspective. We take a look ment commensurate with his or her abilities and
at the first three of these categories. With regard needs. A basic assumption of our culture is that
to the fourth, the term learning disabilities refers everyone is entitled to the most productive life
to a variety of neurological problems that affect of which they are capable. Such goals are often
297
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

298 Part Three. Applications of Testing

achieved by vocational evaluation, which may irony that valid assessment is extremely difficult
involve psychological tests or other procedures to achieve with children for whom it is essential”
such as work samples, where the client is evalu- (p. 51). These authors indicate that with disabled
ated using a simulated work activity or by actual children there are four major goals of testing: (1)
on-the-job evaluation. A number of experts in to predict future status, (2) to prescribe an appro-
this area feel that testing is quite limited. Clearly, priate treatment, (3) to assess progress and/or
a variety of approaches need to be used, and the program evaluation, and (4) to develop a com-
criterion of validity is paramount. Tests can pro- prehensive data base across time and/or across
vide relatively objective, reliable, and valid indices children. It is interesting to note that although
and, compared to other approaches such as on- studies of normal children find relatively poor
the-job evaluation, are less expensive and time predictability from infancy to later periods, stud-
consuming. Tests are typically more comprehen- ies of disabled children find substantial stability –
sive and can provide more varied assessment typically 70% to 90% of the children tested retain
data. At the same time, there are a number of their diagnostic classification upon later test-
legitimate constraints and criticisms that can be ing (e.g., DuBose, 1976; Vandeveer & Schweid,
made of tests, as we will see. 1974).
A. B. Meyer, Fouad, and Klein (1987) point
Vocational placement. Vocational placement is out that in addition to the usual concerns about
not simply a matter of placing the “proper peg reliability and validity, there are three aspects of
in its round hole”; the matter is complicated, importance in the use of tests with rehabilitation
given the complexities of the labor market, the clients:
technological advances that have revolutionized
many jobs, and the nature of the disabilities a 1. The readability level of the instrument is espe-
client may have. In addition, there are a number cially important because such clients may have
of psychological aspects to consider. For exam- limited educational skills, often as a result of the
ple, adults who have disabilities have often been very disability.
sheltered and overprotected by their families and 2. The test should be appropriate for the client;
may have had little exposure to the world of work. this applies particularly to time limits and to per-
The specific use of vocational or career interest formance aspects that may not be suitable for the
tests may be of differential utility in young clients particular client.
with a history of disability vs. older clients who, 3. The test should have a validity scale that
because of some accident, can no longer pursue indicates whether the client is responding ran-
their former occupation (A. B. Meyer, Fouad, & domly or not and consistently (tests such as the
Klein, 1987). MMPI and CPI have such scales as part of the
inventory).
Challenge of testing the handicapped. Individ-
uals with disabilities present a number of chal-
lenges with regard to tests. Because such dis- Wide range of impairment. Even within one
abilities can involve a wide range of conditions, category of disability, there may be a wide range
such as mental retardation, visual impairments, of impairment. For example, individuals who are
chronic health problems, and so on, the chal- motorically impaired and are in a wheelchair,
lenges are numerous and varied. For example, may have no upper body limitations, mild limita-
typical paper-and-pencil inventories require a tions, or moderate to severe limitations. In a test-
fair degree of hand and finger dexterity, so often ing situation, the degree of limitation would dic-
clients cannot complete these without assistance. tate whether the individual can respond on a stan-
Often, disabilities can result in poor self-esteem, dard answer sheet, requires additional breaks, or
a high degree of frustration, depression, denial, uses special equipment. Thus, testing of reha-
poor adjustment, and other psychological aspects bilitation clients requires a thorough knowledge
that may interfere with and/or color the results and awareness of that client’s particular abili-
of a psychological test. Simeonsson, Huntington, ties and disabilities and a sensitivity to individual
and Parse (1980) stated that, “it is an unfortunate differences.
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 299

Common sense guide. Often, testing of the dis- ical records that can yield valuable information
abled requires a great deal of common sense, about a client’s disabilities, or prior testing results
which frequently seems to be a precious com- that may dictate a different or more circumspect
modity. The U.S. Office of Personnel Manage- approach. For example, a client may already have
ment has put together a guide for administer- been tested with a particular instrument, and
ing employment tests to handicapped individuals retesting may be less valid or useful.
(S. M. Heaton, Nelson, and Nester, 1980). If the The third stage is the selection of appropri-
client is blind, for example, the examiner should ate instruments. Ordinarily, these include mea-
speak directly to the person, should not avoid sures of intellectual functioning, assessment of
phrases such as, “it’s nice to see you,” should not hemispheric dominance, assessment of academic
pet or distract the guide dog, and should let the skills, assessment of special characteristics such as
blind person take their arm. With a deaf client, it visual-motor coordination, career interests, per-
is important to determine the best means of com- sonality, and aptitudes.
munication. The examiner should speak clearly In the area of intellectual functioning, a wide
and distinctly in a normal speed and tone, but number of tests can be useful, ranging from
with expression because facial expressions, ges- broad-based tests such as the Wechsler to more
tures, and body movements can help clarify the narrow and limited tests such as the D–48. The
meaning of what is said. If the client is in a area of hemispheric dominance represents a
wheelchair, the examiner should not hold on to newer concern. The research so far suggests that
the wheelchair. If possible, the examiner should the two cerebral hemispheres control different
also be seated because it is uncomfortable for abilities; language and sequential skills are con-
a seated person to look straight up for a long trolled by the left hemisphere, and spatial and
period. simultaneous thinking abilities are controlled by
the right hemisphere. In a large number of adults,
A multivariate approach. There is a strong sug- perhaps as many as 75%, there are anatomic
gestion in the literature that a multivariate asymmetries of the cerebral hemispheres that
approach is best in the testing of disabled indi- may be related to differential performance in
viduals. One aspect of such an approach is to use a variety of areas (Geschwind and Levitsky,
instruments that yield subscale scores (such as 1968). Evidence of hemispheric dominance can
the MMPI, the WAIS, etc.). Another approach is be obtained from a variety of sources, including
to use a battery of instruments, so that hypothe- tests such as the WAIS (verbal vs. perfor-
ses generated by one test result can be corrobo- mance scores) or neuropsychological test batter-
rated in other test results. Finally, we should not ies such as the Halstead-Reitan (see Chapter 15).
lose sight of the fact that tests are but one source In rehabilitation clients, the cerebral domi-
of information, and in assessment there is need nance may be particularly important whether it
for various sources of information (Simeonsson, occurred naturally or whether it was the result of
Huntington, & Parse, 1980). injury, either at birth or subsequent to birth (as
in a motorcycle accident). In the area of academic
Diagnostic assessment. Hartlage (1987) descri- skills, such test information may already be part
bes four stages in the process of diagnostic assess- of the client’s file, such as test scores from rou-
ment with rehabilitation clients. The first is to tine school testing. Sometimes such information
determine the relevant issue or question, that is, when compared with more current test results
what are the goals of assessment for this particu- can yield evidence of intellectual deterioration.
lar client. Often, a battery of tests is administered A comparison of academic skills with intellectual
routinely to all clients. Although such data may functioning can also be quite informative from a
be quite useful for research purposes or may pro- diagnostic point of view.
vide valuable local norms, the usefulness of such The fourth stage involves placing the results
an approach for a particular client may be quite into a specific rehabilitation plan. The informa-
limited. tion obtained from testing needs to be translated
The second stage involves obtaining the nec- into decisions and actions for the benefit of the
essary background data. This can involve med- client.
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

300 Part Three. Applications of Testing

MODIFIED TESTING the underlying construct is still being measured.


Changes in question type or in the variables
In testing clients with disabilities, the examiner
being measured are much more problematic, and
often needs to modify the test and/or testing pro-
most experts would agree that evidence would be
cedure, to be consonant with the client’s capabil-
needed to show that such changes did not alter
ities and needs. Tests can be modified in a vari-
the reliability and validity of the test.
ety of ways, but most changes can be subsumed
under one of three categories: changes in testing
Item content. For some tests, the meaning of a
medium, changes in time limits, and changes in
particular item may in fact be a different stimulus
test content.
for the disabled. The MMPI (see Chapter 7) pro-
vides a good illustration of how the presence of a
Testing medium. In most tests, test information disability or physical illness can affect the mean-
is presented in the English language. Changes ing of test scores. The MMPI contains a number
in testing medium therefore can refer to using of items that relate to physical complaints and
braille, large print, audiotapes, or a reader or physical health, items such as, “I frequently suf-
amanuensis (a person who writes or marks the fer from constipation,” and “My health is not as
answers for the test taker), and so on. Ordinarily good as it use to be.” With individuals who do
these changes do not present any problems from a not have major disabilities, endorsement of such
psychometric perspective, but there are some cir- items may be indicative of psychological difficul-
cumstances that need to be considered. For exam- ties, such as preoccupation with his or her body
ple, long reading passages may be more difficult or the presence of physical symptoms that reflect
when presented orally than in their printed ver- psychological conflicts and inadequate adjust-
sion. Figural materials such as charts can also be ment. With individuals who do have disabilities,
a problem. A figure can be presented by emboss- such endorsement may well reflect the physical
ing, but the tactual sense required to understand reality rather than any psychopathology.
the embossed figure is quite different from the
visual sense needed to understand the original. Small samples. Much work has gone into mod-
Thus the “same” item may be measuring quite ifying tests and testing procedures for use with
different functions in a sighted person than in a disabled individuals. At the same time, there have
visually impaired person. been fewer efforts to determine the effects of
such modifications on the reliability and validity
Time limits. We saw the distinction between of specific tests. In part, a major reason offered
power and speed tests in Chapter 1. Changing for the paucity of such studies is the availabil-
time limits especially with speed tests creates ity of only small, atypical samples, while psy-
problems psychometrically. Many tests that are chometric procedures to assess reliability and
power tests are actually “speeded power” tests, in validity typically require large and representative
that a considerable number of people are unable samples.
to attempt every question. Having additional A Panel on Testing of Handicapped People,
time, as might be done with a disabled subject, formed because of the Rehabilitation Act of 1973,
gives an “unfair” advantage, but typically people examined various issues concerning the use of
with disabilities do need additional time. Nester standardized tests in making decisions about
(1993) suggests that the ideal solution is to elim- people with disabilities and made a number
inate the use of speeded power tests. of specific suggestions (Sherman & Robinson,
1982). With regard to the validity of tests, the
Test content. Changes in test content might Panel urged that validity information on disabled
include changes in individual test questions, samples was urgently needed. In particular, infor-
changes in the question type being used, or mation on the effects of modifying tests and test
changes in one of the variables being measured administration procedures on reliability, valid-
(for example, by dropping a subtest). From a con- ity, and other aspects of test performance, was
struct validity point of view, changes in a specific needed. Furthermore, the Panel suggested that
test question are acceptable because presumably techniques to validate tests with small samples
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 301

needed to be developed and used. Other pos- modified) test administrations using a variety of
sibilities, such as pooling of data from several indices including test reliability, factor structure,
samples, needed to be explored. With regard how test items functioned, admissions decisions,
to the testing situation, the Panel pointed out and so on. Although there were a number of spe-
that little was known about the relationship of cific findings, the general conclusion was that the
aspects of the disabled person and the testing sit- nonstandard versions of the SAT and the GRE
uation as reflected in such questions as: Why do were generally comparable to the standardized
some disabled persons take the standard form of versions in most important aspects. What these
a test, such as the SAT, rather than the modi- studies do show is that carefully thought-out test
fied version? Are there other approaches besides accommodations can be successful in preserving
paper-and-pencil tests? How does the attitude of the reliability and validity of tests (Nester, 1993).
the examiner affect the test scores of the disabled
person? How do coaching and training in test-
SOME GENERAL RESULTS
taking strategies affect the performance of dis-
abled persons? Finally, with regard to the actual Basic issues. In testing disabled clients, there
decision-making process, how are test scores are a number of basic issues to keep in mind.
actually used in making decisions about disabled Tests that are administered individually, such as
individuals, and how do “flagged” (i.e., the results the WAIS, allow the examiner to observe the
of a nonstandard test administration identified as client and gather clinically relevant information.
such) test scores influence the decision-making Group-administered tests, although more eco-
process? The Panel also concluded that tests can nomical and usually easier to administer, do not
provide good objective measures of merit and usually provide such information. Sometimes,
provide the opportunity to show that a person tests that were originally developed for group
has the required abilities. At the same time, mem- administration are given on an individual basis,
bers of the Panel felt that psychometric technol- especially by computer (see Chapter 17).
ogy simply was not advanced enough to ensure Speed of performance is sometimes an integral
that tests could measure skills independent of dis- part of a test. For example, in the assessment of
abling conditions, and they recommended that intelligence, some tests use items that have a time
multiple sources of information to supplement limit. Handicapped clients often do less well on
test scores should be used. timed tests, and their disability may interfere with
Note that when we ask how changes in a test rapidity of response. The crucial question here is
affect that tests’s reliability and validity, the impli- whether the person’s ability is underestimated by
cation is that the potential result is to lower the a speeded test.
reliability and validity. Quite legitimately, we can Finally, we must note that the frequency of
ask what happens if we administer a test with no some disabilities seems to be greater among
modification to a disabled individual. In many minority and/or disadvantaged groups. So here
cases, the test would be invalid. we have the questions of potential test bias and
of appropriate norms.
ETS studies. The major source of information
about the reliability and validity of modified test- Test sources. In general, not much is available
ing was compiled by researchers at Educational to the tester who needs to assess disabled clients.
Testing Service (ETS). From 1982 to 1986, ETS Many textbooks do not mention such a topic,
conducted a series of studies on the performance and appropriate journal articles are scarce and
of individuals with disabilities on the SAT and the scattered throughout the literature. One useful
GRE. Four types of disabilities were considered: source is the book by Bolton (1990), which cov-
visual impairment, hearing impairment, physical ers a collection of 95 instruments for use with
impairment, and learning disabilities. The results disabled clients.
of these studies were described in a compre-
hensive report (Willingham, Ragosta, Bennett, Career assessment. Individuals who are dis-
et al., 1988). These researchers assessed the com- abled often have great difficulties in obtaining
parability of standard and nonstandard (i.e., employment. Although some of these difficulties
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

302 Part Three. Applications of Testing

are due to prejudice, stereotypes, or realistic can be done by the examiner if necessary, without
demands of a specific occupation, some of the altering the test.
difficulties are due to the nature of the disability Test-retest reliability over a 6-month period
and its concomitant conditions (e.g., lower self- seems adequate to marginal, with most coeffi-
esteem). Vocational guidance then becomes very cients falling below .80; predictive validity infor-
important, and the use of career-interest tests mation is not given.
potentially useful. The results of career-interest The GPII-R has been criticized that its black-
tests can: (1) provide a focus for vocational eval- and-white line drawings appear dated, that some
uation and planning; (2) yield information that are ambiguous, and that many occupations cov-
the client can use to achieve career goals; (3) iden- ered reflect gender stereotypes. Some other pic-
tify occupations and interest areas that the client ture interest inventories are the Wide-Range
might not have considered; (4) provide both the Interest-Opinion Test (J. F. Jastak & S. Jastak,
counselor and the client with useful informa- 1979), the Pictorial Inventory of Careers (Kosuth,
tion about the client’s orientation toward work 1984–1985), and the Reading-Free Vocational
and level of vocational maturity; and (5) provide Interest Inventory-Revised (R. L. Becker, 1988).
valuable information as to the selection of a col- Many of these picture inventories have inade-
lege major or appropriate educational pathway, quate standardization and/or normative samples;
such as technical institute, community college, some samples are quite small, others are atypical,
or apprenticeship (L. K. Elksnin & N. Elksnin, and still others are not described adequately in
1993). terms of demographic variables. Sometimes, cur-
In Chapter 6, we discussed the two major rent norms are not available. Thus, the examiner
career-interest inventories – the Strong Interest is often left without the necessary information
Inventory and the Kuder inventories. These can needed to make an informed decision about the
be very useful with some disabled clients, but also meaning of the client’s scores. In general, many
have limited utility with other disabled clients; if not most of these instruments are technically
they require not only the ability to read, but to inadequate, and their results should be confirmed
read at a fairly sophisticated level, with instruc- through other sources of information (L. K.
tions typically requiring high-school or college- Elksnin & N. Elksnin, 1993).
level reading comprehension. A more basic issue has to do with the devel-
A number of career-interest inventories have opmental aspects of career interests. In normal
been developed that eliminate reading by using subjects, career interests begin to stabilize around
drawings that typically show individuals engaged ages 17 to 18, and testing of younger subjects
in different occupations. Typical of these picture- to elicit their career interests yields limited and
interest inventories is the Geist Picture Interest unstable data. Clients who have developmen-
Inventory-Revised (GPII-R; Geist, 1988). The tal disabilities, for example, those born blind,
GPII-R is designed to assess 11 masculine and may also experience delays in the development
12 feminine general-interest areas of individu- and stability of their career interests, or may be
als who have limited reading/verbal skills. These deprived of the kind of experiences such as after-
areas include mechanical, music, outdoors, social school work that crystallize into career interests.
service, and personal service (such as beautician). Such delays and deprivations may make career-
The GPII-R is composed of 44 picture triads in interest test results invalid. In addition, the most
which each drawing depicts a person engaged in technically competent career-interest tests are
work activities that reflect a particular interest geared for professions or occupations that require
area. For each triad, the client is asked to indi- a fair amount of education. Often, clients in reha-
cate which activity he or she is most interested in bilitation settings may be more realistically ori-
doing. The GPII-R is suitable for normal children ented towards lower-level occupations, and tests
in grades 8 to 12 and with adults who have limited available for these are not as psychometrically
verbal skills. It is untimed, requires hand scor- sophisticated.
ing, and can be administered to both individuals
or groups. The inventory requires that the client Admissions testing. Most colleges, graduate-
circle the drawing of choice, but presumably this school programs, medical and law schools, and
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 303

other professional programs require some test legal and federal guidelines and, in part, because
such as the SAT (Scholastic Aptitude Test), GRE of fear of legal suits.
(Graduate Record Exam), or MCAT (Medical
College Admission Test) as part of the admis- Aptitude and achievement tests. Rehabilita-
sions procedure. A number of disabled appli- tion personnel who do counseling rely heavily on
cants take these tests each year, and most of these both aptitude and achievement tests, which hap-
tests have special provisions for such applicants. pen to be categories of tests that have achieved
How well do such test scores predict the perfor- excellent technical levels and high degrees of reli-
mance of disabled students? There is relatively ability and validity.
little data available, but Sherman and Robin- Originally, aptitudes were seen as basically
son (1982) give some representative examples. In innate, while achievement reflected knowledge
one study of disabled students (type of disability and skills that were learned. Recently however, the
not specified), both their high-school and college distinction between the two has become some-
grades were somewhat lower than for nondis- what blurred, and the notion of “innate” has
abled students (e.g., 0.17 grade point lower for been rejected. Most experts define aptitudes as
first year of college). The correlation between the the result of lifelong learning, while achievement
college entrance exam (ACT in this case) and col- refers to a specific learning experience. Thus,
lege freshman grades was .46, while the corre- playing tennis would be considered an achieve-
lation between high-school grades and college- ment, the result of specific training, while the
freshman grades was .43. For nondisabled ability to play sports in general would be more
students, the correlations were quite similar, of an aptitude, or the result of many influences
except that for nondisabled students high-school that could encompass physique, parental expec-
grades were a slightly, better predictor of college tations, upbringing, role models, and so on.
grades than were test scores. Some studies have The reliability of such tests is typically high
indicated that tests such as the SAT are as good a with median reliability coefficients of achieve-
predictor of college grades for both disabled and ment test batteries in the low .90s and of aptitude
nondisabled students, while other studies suggest batteries in the high .80s.
that measures specifically designed for a disabled Ghiselli (1966) reviewed the professional lit-
population are better predictors. erature to assess the general level of validity of
occupational aptitude tests. He classified the apti-
tude tests used in personnel selection and place-
World of work. Licensing or certification pro- ment into five major types: (1) tests of intellectual
cedures that include a test are required for many abilities, measuring such aspects as memory and
occupations. In addition, civil-service systems arithmetic computation; (2) tests of spatial and
rely heavily on screening through testing. Thus, mechanical abilities; such tests are of limited use
disabled individuals may encounter such tests, with visually handicapped clients beacause they
especially in rehabilitation agencies that assist require a fair degree of vision; (3) tests of percep-
them with training and obtaining employment. tual accuracy; these usually involve sets of items
Employment testing has been under close where the subject needs to decide if they are the
scrutiny since the mid 1960s when it was observed same or different, or items where the client is
that the use of such tests tended to screen out required to cancel out something, for example
minority-group members. Such close scrutiny the letter “e” in a paragraph of prose. Some of
had three major consequences: (1) The impor- these procedures can be administered through
tance of tests in the total hiring process was over- braille; (4) tests of motor abilities, for exam-
estimated. It is rare that decisions to hire are ple, tests that require manual dexterity where
primarily driven by test results, and test results blocks and cylinders are moved as rapidly as pos-
are typically seen as just one of many sources of sible into different configurations; (5) personality
information. (2) Much effort has been invested tests, including both paper-and-pencil invento-
in employment testing, including litigation of ries like the MMPI and CPI, and projective tests
results. (3) Use of tests in the world of work has such as the Rorschach and Thematic Appercep-
declined, in large part because of confusion about tion Test. Some of these are useful even with
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

304 Part Three. Applications of Testing

visually handicapped clients; for example, E. L. From a legal standpoint, it is assumed that dis-
Wilson (1980) reports on the use of the TAT with abled people can be tested in a way that will not
totally blind clients, where the stimulus picture reflect the effects of their disability, and that the
is described to the client. obtained scores will be comparable with those
The findings of Ghiselli’s review are rather of people who are not physically challenged.
complex, but we can summarize some of the Whether in fact such assumptions can be met
major conclusions. Training, as reflected by cri- psychometrically, is debatable.
teria such as grades in occupational-training
courses, was predicted well by tests of intellec- Minimum competency testing. There has been
tual abilities, spatial and mechanical abilities, and widespread concern in the United States that
perceptual accuracy. Proficiency, as reflected by graduating high-school seniors have not received
criteria such as supervisors’ ratings or measures an adequate education. One response on the
of dollar volume of sales, was less successfully part of many states is to mandate some form of
predicted, with most correlation coefficients in minimum competency testing. That is, before a
the .20s, except for sales and service occupa- student is promoted or awarded a diploma, he
tions where personality tests seemed substan- or she must show minimum competencies as
tially more valid predictors. Overall, Ghiselli con- assessed by a battery of achievement tests. Such
cluded that tests possessed sufficient predictive programs have become major political issues
power to be of considerable value in personnel in some communities, especially because such
selection. Recent studies using metaanalysis do tests often show differential racial performance.
suggest that tests of cognitive abilities correlate Another issue has to do with disabled students.
from the low .30s to the mid .70s for predicting Should they be exempted from such tests? Should
training success, and from the .30s to the low .60s they be allowed extended time, the use of read-
for predicting job proficiency (J. E. Hunter & R. F. ers, or other nonstandard procedures? Should
Hunter, 1984). These results come from studies different tests be used that are tailored to the
of “normal” subjects, but presumably would gen- needs of the student? Such issues are not easy
eralize to studies of rehabilitation clients. to resolve because they transcend psychomet-
ric issues (Haney & Madaus, 1978; Sherman &
Robinson, 1982).
LEGAL ISSUES

Full participation. In the United States, many of Which norms? The issue of nondiscrimination
the important decisions made about an individ- against persons with disabilities is relatively new
ual such as placement in particular educational in the area of psychometrics, and very little
programs, admission to college and professional research has been devoted to it. In the past, most
schools, military assignment, and so on, are made assessment efforts were devoted to placement in
in part upon the basis of test results. One can settings where the disabled individual would not
argue that such use of tests is suspect and neg- compete with “normal” persons, and therefore
ative, and that tests ought to be banned. It can the question of norms was not an important one.
also be argued that denying a person the oppor- Typically, available test norms are for “normal”
tunity to take a test and to show in an objective groups, perhaps a captive group working in a spe-
and standardized manner what the person can cific industry, such as 62 assembly-line workers at
do, is an infringement of that person’s civil rights a General Motors plant in Michigan. Such norms
(Sherman & Robinson, 1982). may well represent a group that has already been
Full participation in American society has screened, where incompetent workers were not
become a major goal of the disabled, one that hired, and the group has survived a probationary
has been supported through a variety of fed- period; it most likely contains few, if any indi-
eral legislative acts. Thus, children with special- viduals, with disabilities. Sometimes published
educational needs are no longer segregated, but norms are available for particular subgroups of
where possible are “mainstreamed” in the regular disabled clients, perhaps tested at a rehabilita-
classroom, and such decisions are made in part tion agency or at a training school. Sometimes
on the basis of test results. the local agency where a client is tested develops
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 305

its own in-house norms, and sometimes it does 2. Verbal aptitude, the ability to understand
follow up clients to determine the relationship words, to use them effectively, to understand rela-
between test scores and subsequent on-the-job tionships between words, to present ideas clearly.
competence. In general, however, much remains 3. Numerical aptitude, the ability to per-
to be done to provide practitioners with better, form arithmetic operations accurately and with
well-defined norms. rapidity.
When we compare a client’s raw score on 4. Spatial aptitude, the ability to visualize objects
a test with norms, we are in effect describ- in three dimensional space.
ing that client’s level of performance relative to
5. Form perception, the ability to perceive details
some comparison group. Thus, Rebecca takes the
and differences in different objects.
WAIS-R and her raw score is translated, through
norms, into an IQ of 126. Ordinarily, however, 6. Clerical perception, the ability to proofread
we are not content in simply describing, but also correctly, to perceive pertinent details.
wish to forecast future behavior. We might want 7. Motor coordination, the ability to coordinate
to predict, for example, how well Rebecca will visual and motor movements, to make move-
do in college. But such prediction needs to be ments accurately and rapidly.
made not just on the basis of descriptive data, i.e. 8. Finger dexterity, the ability to manipulate
norms, but on the basis of predictive evidence, small objects rapidly and accurately.
such as expectancy tables or regression equations. 9. Manual dexterity, the ability to work with one’s
For rehabilitation clients, such data is not readily hands easily and skillfully.
available.
The first seven aptitudes use paper-and-pencil
measures; the last two aptitudes are assessed
Governmental efforts. Two governmental age- through performance tests. The original edition
ncies have been involved in the testing of of the GATB was published in 1947 (Dvorak,
the disabled and in efforts to solve various 1947) and was based on a factor-analytic
legal/psychometric problems. One agency is the approach – namely, there is a basic set of apti-
U.S. Civil Service Commission (now known as tudes, and different occupations require differ-
the U.S. Office of Personnel Management, or ent combinations of such aptitudes. A battery of
OPM), and the other is the U.S. Department of some 50 tests was thus factor analyzed; the result-
Labor. The OPM develops tests used to select peo- ing 9 aptitudes could then be measured by just
ple for jobs in the federal government. Beginning using 12 of the 50 subtests. The chosen tests were
in the 1950s, they developed tests to select blind selected on the basis of their factorial validity and
applicants for industrial jobs, studied the valid- their empirical validity against some external cri-
ity of these tests simultaneously for blind and terion (Droege, 1987). The 12 tests were then
for sighted persons, and have continued various administered to a sample of workers in a wide
studies up to the present. variety of occupations. The data from this sample
The U.S. Department of Labor has, among was used to develop tables to convert raw scores
other activities, investigated the scoring patterns into standardized scores with a mean of 100 and
of deaf, emotionally disturbed, and mentally SD of 20 (Mapou, 1955). Incidentally, the Labor
retarded persons on the General Aptitude Test Department routinely uses race norming on this
Battery (GATB), an instrument used by state test, that is, the percentile scores of whites, blacks,
employment services for a wide range of jobs and Hispanics are computed within each racial
(Nester, 1993); in fact, the GATB is the most group.
widely used test for state and local government Administration of the entire battery requires
hiring and private-sector job referral. The GATB about 2 hours and 15 minutes. One of the
consists of 12 tests that measure 9 aptitudes: unique aspects of the GATB is that there are
occupational-aptitude patterns that indicate the
1. General learning ability (or intelligence), aptitude requirements for various occupations,
defined as the ability to understand instructions, so that a specific client’s scores can be matched
to reason, and to make judgments. against specific patterns. There is a tremendous
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

306 Part Three. Applications of Testing

amount of technical material on the GATB, rang- participation; (2) access to public accommoda-
ing from a very hefty test manual to government tions, such as restaurants; (3) access to publicly
reports and studies in the scientific literature. For owned facilities, such as parks; (4) education; and
a description of other aspects of this program, (5) employment. Much of the language of the Act
see Droege (1987). Like other tests, the GATB is was prohibitory in nature (thou shalt not . . .),
not free of criticism, and despite its wide use, it but there were a number of provisions (called
seems to have some significant weaknesses (see titles) for implementing various sections of the
Keesling, 1985). We will return to this test in Act. Title VII for example, enumerated unlaw-
Chapter 14. ful employment practices and established the
Equal Employment Opportunity Commission as
The “Standards.” The Standards for Educational a means of implementing its provisions. Most
and Psychological Testing issued in 1985 (see legal challenges to the use of standardized tests
Chapter 1) contain a chapter on testing people have fallen under this Title. In effect, Title VII
with disabilities. The chapter stresses that caution indicates that if a test results in a proportionally
must be used in interpreting the results of mod- lower selection rate for minorities and females
ified tests because of lack of data; it encourages than for white males, the procedure is considered
the development of tests for people who are dis- discriminatory, unless the test can be validated in
abled. These standards include the requirement accord with specific guidelines called the techni-
that the examiner have psychometric expertise cal validation requirements.
and knowledge about the interaction of handicap
and test modification, that such modifications be Rehabilitation Act of 1973. This act extended
pilot-tested, that empirical procedures be used to civil rights protection to people with disabilities
establish time limits, and that the effects of mod- and was a direct descendant of the Civil Rights
ifications on reliability and validity be investi- Act. One section of the 1973 Act, Section 504,
gated. In addition, general population norms or specifically indicates that a disabled individual
specific norms, i.e., disabled, be used depending shall not be subjected to discrimination. Unfor-
on the purposes of testing. Other experts have, tunately, this act was enacted without specific
however, expressed the concern that the legal and indications of how the law was to be imple-
regulatory requirements of the Americans with mented. Eventually, guidelines for ending dis-
Disabilities Act of 1990 go beyond the state of the crimination in areas such as education and
science of testing. employment practices on the basis of disability
One of the basic principles of measurement, were developed. For example, employers cannot
as we saw in Chapter 2, is that of standard- use tests that tend to screen out disabled individ-
ization. The use of modified testing procedures uals unless: (1) it can be shown that the test is
departs from this principle and may affect mea- job-related, and (2) alternative tests that do not
surement in an unknown way. Another issue screen out disabled individuals are not available
of concern is whether test scores from non- (see Sherman & Robinson, 1982, for a detailed
standard administrations of standardized tests, discussion).
such as the SAT, should be “flagged” to indicate With regard to admission to postsecondary
the non-standard administration. Federal regu- educational institutions (such as college), Sec-
lations prohibit inquiry about a disability before tion 504 guidelines eventually evolved to cover
employment or during admission procedures, four psychometric aspects:
but “flagging” in fact alerts the selection officials
that the candidate has a disability. 1. A test that has a disproportionate adverse
effects on disabled applicants may not be used,
Civil Rights Act of 1964 (PL 88–352). This well unless the test has been validated specifically for
known legislative act made it illegal to dis- the purpose in question, or unless alternative tests
criminate among people on the basis of “race, with less adverse effects do not exist.
color, religion, sex, or national origin.” The 2. Tests are selected and administered so that the
Act addressed five major areas in which blacks results reflect the disabled applicant’s capabili-
had suffered unequal treatment: (1) political ties on the variable being measured, rather than
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 307

the applicant’s impaired skills (unless the test is be used in a nondiscriminatory manner. These
measuring those skills). tests must be administered in the client’s “native
3. An institution may not inquire before admis- language,” which for hearing-impaired children
sion whether a person is disabled; that is, test is their “normal” mode of communication, i.e.,
scores for a disabled person should be reported sign language, oral communication, mime, finger
in identical manner to those of nondisabled spelling, or other method. The tests must also be
persons. selected and administered so that children with
4. Prediction equations may be based on first- impaired skills are not penalized for their lack of
year college grades, but periodic validity studies language ability. This means, in effect, that the
against the criterion of overall success in the edu- examiner must be not only an expert test admin-
cation program in question will be carried out. istrator, but must also be expert in the method of
communication.
These four provisions have generated a fair
amount of psychometric research, although the
THE VISUALLY IMPAIRED
overall picture is far from complete or compre-
hensive. Types of tests. Tests that are used with the
These governmental regulations then, require visually impaired seem to fall into one of four
reasonable accommodation to the needs of dis- categories: (1) tests that were developed for the
abled persons, including appropriate adjustment general population of normally sighted individu-
or modifications to tests. Furthermore, as stated als and are simply used with the visually impaired;
above, these regulations prohibit the use of a (2) tests developed for the general population,
selection criterion that tends to screen out dis- but with specific norms for the visually impaired;
abled persons unless that criterion has been (3) tests adapted for the visually impaired by the
shown to be job-related, and there is no alter- use of braille, large print, or other procedure;
native procedure. Finally, these regulations state (4) tests developed specifically for the visually
that tests should be selected and administered impaired.
so that sensory and speaking disabilities do not
interfere with effective measurement (Nester, Visual nature of testing. Many problems com-
1984; 1993). plicate the assessment of the visually impaired
These issues become particularly crucial in a person. A major problem is that most tests involve
competitive employment situation where stan- sight because of the visual nature of the test items
dardized assessment instruments are used to themselves, such as on the WAIS-R, or because
compare applicants and to select the most quali- of the way the measure is administered under
fied. People who have physical disabilities will be standard conditions, as with the printed booklet
at a disadvantage, if their disability interferes with of the MMPI. (There are braille versions of the
the accurate measurement of their abilities. The MMPI, e.g., O. H. Cross, 1947.)
bottom line is that tests must not discriminate
and must reflect the intended construct rather Visual impairment. There are many degrees of
than the disability. visual impairment; some individuals who are
“legally” blind may have sufficient vision to read
Public Law 94–142. As we saw in Chapter 9, this large print, with or without corrective lenses.
law, the Education for All Handicapped Chil- Blindness can also be present from birth (the con-
dren Act, mandates that a full and individual genitally blind) or can occur any time after birth
evaluation of a child’s educational needs be car- (the adventitiously blind).
ried out before the child is assigned to a special-
education program. However, the specifics of Visual impairment in the elderly. For the indi-
the evaluation are not spelled out, so that the vidual with normal vision, there are two major
number and nature of tests used can differ quite causes of visual impairment as the person gets
widely from one setting to another. The Act also older. One of these is cataracts, a clouding of
mandates that tests and procedures used for the the lens of the eye and a condition that is now
evaluation and placement of disabled children amenable to corrective surgery. The second is
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

308 Part Three. Applications of Testing

macular degeneration. The macula is the central Need for special norms. Standard test norms,
region of the retina responsible for fine visual usually developed on sighted samples, can be
acuity; degeneration of the macula affects vision, used only with great caution because adminis-
and in particular, the ability to read. tration of a test to a blind client is often sub-
stantially different from that of a sighted client.
Modified testing. Quite often, the test changes Norms based on visually impaired are sometimes
that are needed to assess a visually handicapped available, but often they are neither random nor
person involve changes in the administration representative. Quite often, blind clients who par-
method. For example, test items may need to be ticipate in testing for normative purposes may be
presented in large print, braille, or on an audio captive clients of an agency that serves a particular
tape. Or a sighted reader may be required, one segment of the population, for example, ambula-
who reads and enunciates well. These procedures tory blind clients being retrained for a new job. In
require more time than the standard method, and addition, the term “blind” encompasses a rather
so usually speed tests are not appropriate for visu- wide range of amount of useful vision, so the nor-
ally impaired persons. When there are time limits mative sample needs to be carefully designated in
on power tests, these limits are usually for admin- terms of such vision. Finally, if the goal of test-
istrative convenience, so unlimited time could, ing is to provide information about the client in
at least theoretically, be given to all candidates, terms of competitive employment, then the more
impaired or not. useful information is that based on the sighted
There is also concern about the test content. In individuals with whom the client will compete.
general, verbal tests such as the verbal portions The need for special norms for the visually
of the Wechsler tests are appropriate for admin- impaired is well illustrated by the Minnesota Rate
istration. Nonverbal items present a challenge. In of Manipulation Test (MRMT; American Guid-
fact, verbal cognitive tests seem to be valid predic- ance Service, 1969). This test consists of a form-
tors of work or academic success for the visually board on which there are 40 round blocks, each
impaired. What limited evidence there is seems nestled in a hole. The task is to turn the blocks
to support the criterion validity of such tests (e.g., over as rapidly as possible, as well as to move them
Suinn, Dauterman, & Shapiro, 1967). in various prescribed ways. The test is designed
to assess various aspects of motor coordination,
and scores on it are related to success in various
Some recommendations. In testing blind
industrial-type jobs. The test was developed and
clients, it is recommended that the blind client normed on normal-sighted individuals, but the
be given whatever information a sighted client manual states that scores of blind persons corre-
would have by nature of their sight (Cautilli spond closely with the normative data. However,
& Bauman, 1987). Thus, the examiner should Needham and Eldridge (1990) showed that blind
describe the test materials and, where appropri- persons do score significantly below the norms
ate, encourage the client to explore the materials of sighted persons. They suggested that its use
by touch. If the test items are to be presented as a predictor of motor skills in the blind may
by braille, then the braille reading ability of the be limited, but that the test seems to be of value
client needs to be determined. Having poor as a neuropsychological tool to identify brain-
or no vision does not automatically guarantee damaged, visually impaired persons.
proficiency in braille reading.
If the items are presented orally, dichotomous
items, such as true-false, can be read orally by the Cognitive abilities. The cognitive abilities of the
examiner or a reader and the answers recorded for visually impaired are often measured by verbal
the client. Multiple-choice items represent more tests, such as the verbal portions of the WAIS-R.
of a challenge because they require that the client In fact, the verbal portions of the Wechsler scales
keep in mind the various options. At the same are particularly useful. They can be administered
time, if the answers are recorded by the examiner, to the blind client exactly as with a sighted client,
there is less privacy for the blind client than there and therefore the regular norms can be used. A
would be for a sighted client. number of authors have argued that nonverbal
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 309

Table 12–1. WAIS Verbal Subtests Findings on Visually Impaired (Vander Kolk, 1982)
Adventitious vs. Congenitally Blind No vision Partial vision
WAIS Verbal
Subtest Mean SD Mean SD Mean SD Mean SD
Information 10.36 2.25 10.02 2.10 10.59 4.31 9.87 4.35
Comprehension 10.91 2.23 10.05 2.54 9.81 4.18 10.29 2.68
Arithmetic 11.47 3.12 9.63 2.28 8.72 3.79 10.35 2.71
Similarities 11.99 3.20 10.90 2.24 11.04 4.17 11.30 2.75
Digit Span 9.68 3.33 10.59 2.30 11.69 3.44 10.28 2.40
Vocabulary 10.85 2.56 9.86 2.14 10.45 3.70 9.85 2.28

abilities should also be assessed because the 3. Adventitiously blind persons obtained average
results yield complementary information with scale scores somewhat above sighted norms on
respect to general intelligence, are less influ- the Arithmetic and Similarities subtests, while the
enced by extraneous factors such as poor socioe- congenitally blind group scored essentially in the
conomic background, may provide opportuni- normal range. Individuals who had no useful
ties to observe qualitative aspects of behavior, vision scored higher than the norm on Similar-
and may possibly be of diagnostic significance, ities and Digit Span, and lower on Arithmetic.
if there is a discrepancy between verbal and Those with partial vision scored essentially at the
nonverbal abilities (Dekker, Drenth, Zaal, et al., norm, but higher on Similarities. Overall, the
1990). subgroup with no light perception seems more
Studies of visually impaired students who variable (note the larger SDs).
take special administrations of college admis- The results of this study indicate that the visu-
sion tests, such as the SAT and the ACT, suggest ally impaired do not differ on intelligence subtest
that their mean scores do not differ significantly scores from their sighted counterparts. These
from those of their normal-vision peers, to the results must be taken with some caution for sev-
degree that the test is untimed. Studies of such eral reasons. First of all, we don’t know whether
tests as the WISC and the WAIS, also suggest that these clients are representative of the population
mean scores of the visually impaired are equiv- of visually impaired. Those who seek out the ser-
alent to those of the sighted population. On the vices of a rehabilitation agency may be more intel-
other hand, performance on achievement tests ligent than those who do not seek out community
is significantly below that of normal-sighted stu- resources to aid them with their disability. Sec-
dents (Willingham et al., 1988). Very few studies ond, the comparisons presented in Table 12.1 do
exist on the validity of cognitive tests for visually not account for any potential differences in edu-
impaired students. cation, age, or other variables between subgroups
A representative study is that of Vander Kolk that might covary with intelligence; that is, there
(1982). He studied 597 legally blind clients of is no indication in the study that the samples were
two rehabilitation agencies who had been admin- equated.
istered the WAIS verbal subtests. Vander Kolk
found that the verbal IQ scores for various sub- The Perkins-Binet. The intellectual functioning
groups within this sample averaged from 104 to of visually impaired children is typically assessed
107, all in the normal range, and indeed higher with modified versions of standardized tests such
than the normative 100. as the Stanford-Binet or the Wechsler tests. How-
Comparisons of subtest mean scores were ever, a modification of a test may be such as to
made between clients who were adventitiously actually result in a new test, rather than just a
blind vs. those who were congenitally blind, as slightly altered form. An example of this is the
well as between subjects with no light perceptions Perkins-Binet Test of Intelligence for the Blind
vs. those with some partial vision. The results are (C. Davis, 1980). Perkins is a school for the
presented in Table 12.1. blind in Watertown, Massachusetts. In the
Recall that on the Wechsler tests, the subtests Perkins-Binet a number of items have been
are normed so that the mean is 10 and the SD is altered, instructions have been changed, and time
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

310 Part Three. Applications of Testing

limits omitted. There are two forms available: The total test battery consisted of 13 of
form N for children aged 4 through 18 with these subtests, covering both verbal abilities and
nonusable vision, and form U for children aged “haptic” (i.e., active tactual perception) abilities.
3 through 18 with some usable vision. Because the battery is intended for use with chil-
The test was standardized on low-vision and dren who are totally blind as well as with children
blind school children, and measures performance who have low vision, the authors used dark or
as well as verbal abilities. This test represents a light gray test materials to minimize any vision a
valiant but somewhat limited effort, and many child may have. As part of the test battery, they
criticisms have been voiced. First of all, some used the Dot test, using 10 cards, each of which
degree of judgment is required to select the has a number of dots, from 0 to 5, printed on
appropriate form. Secondly, the instructions may it. The child is asked to indicate the number of
be altered to fit the particular testing situation; dots on each card; one or more correct responses
although this is a realistic approach, the fact that would indicate that the child has some usable
the examiner is permitted to alter such instruc- vision.
tions may introduce an additional source of error The investigators studied this battery with 106
variance. There also seem to be a number of dis- Dutch and Belgian children without usable vision
crepancies between what the test manual says and 49 with usable vision. They report consider-
and the actual test materials (Genshaft & Ward, able data including the fact that for both ver-
1982). Gutterman (1985) examined the correla- bal and haptic subtests the mean difficulty level
tion of the Perkins-Binet with the verbal scale of was .50, with alpha reliability coefficients ranging
the WISC-R and an achievement test in a sam- from .76 to .94, with a median value of .86, and
ple of 52 low-vision children, and concluded that no evidence of item bias. Correlations between
the Perkins was not appropriate for these chil- total scores and various measures of academic
dren because it was psychometrically inadequate achievement were typically in the .40s and low
and lacked reliability and validity. .50s, comparable to the coefficients reported in
studies of normally sighted children.
A Dutch cognitive test. Over the years, a few
investigators have attempted to develop stan- Tactual forms. Another approach is to use tac-
dardized tests of cognitive abilities for the visu- tual forms of standardized tests; that is, paper-
ally impaired, as opposed to modifying available and-pencil type items are changed into con-
tests such as the Stanford-Binet. A group of Dutch crete objects or embossed items that can be
investigators (Dekker, Drenth, Zaal, et al., 1990) physically felt. For example Rich and Anderson
set out to construct just such a test series for blind (1965) developed a tactual form of the Raven’s
and low-vision children, aged 6 to 15. They used Coloured Progressive Matrices (see Chapter 5);
the Primary Factors theory of Thurstone (1938) they reported high reliability and moderate valid-
which identified seven primary factors or com- ity coefficients in a group of 115 blind children.
ponents of intelligence: verbal comprehension, Another example is the study by G. Domino
memory, numerical fluency, verbal fluency, rea- (1968), who developed a tactual version of the
soning, perceptual speed, and spatial ability. For D-48 (see Chapter 5), utilizing commercially
each of these dimensions (except for numerical available three-dimensional dominoes. Each D-
fluency), they either borrowed a test from a stan- 48 item was then presented individually to 30
dardized instrument (e.g., the WISC-R vocabu- male adults, aged 20 to 46, all totally blind from
lary subtest as a measure of verbal comprehen- birth, following the same sequence of the stan-
sion) or developed a new test. An example of dard D-48 but with no time limit. The subjects
the latter is the Perception of Objects subtest, were also administered the verbal portion of the
designed to measure perceptual speed and con- WAIS, an interview that was subsequently rated
sisting of a series of five small three-dimensional as to intellectual efficiency and observer ratings
objects such as buttons, which are presented to on two scales measuring motivational and adjust-
the child. The object on the far left is identical ment aspects.
to one of the four others that the child is asked to A split-half reliability, with the halves com-
identify. posed of items of comparable difficulty, yielded a
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 311

Spearman-Brown coefficient of .87. A compari- better explanatory framework (e.g., Gibson,


son of the difficulty level of the test items for these 1966; Piaget, 1967).
blind subjects vs. a sample of sighted children
who took the standard form, yielded a coefficient The Frostig Developmental Test of Visual Per-
of .93, indicating a very close correspondence in ception. A test that is illustrative of some of the
the two samples of which items were difficult and issues and problems in the area of visual per-
which were easy. However, the mean obtained by ception is the Frostig (Frostig, Lefever, &Whittle-
these blind adults was substantially lower than the sey, 1966; Maslow, Frostig, Lefever, et al., 1964).
means obtained by fifth- and sixth-grade sighted This test was developed based on the obser-
children. The tactual form of the D-48 proved to vation that children who had school difficul-
be a difficult task, as shown by the mean score on ties often experienced difficulty in processing
the WAIS Verbal IQ of 108.9. visual stimuli. These difficulties seemed to fall
The results indicated that scores on the D-48 into five major areas: (1) eye-motor coordination
correlated significantly with WAIS Verbal IQ (r = (assessed by drawing lines between boundaries);
.58), with observers’ ratings on personal adjust- (2) figure-ground (assessed by such items as hid-
ment (r = .39), with interview ratings of intellec- den figures); (3) form constancy (discriminating
tual efficiency(r = .84), with age (r = .39), and specific geometric figures from other shapes); (4)
with educational level (r = .42). Thus, the validity position in space (correctly identifying rotated
of the D-48 with blind adults seems substantial. figures); and (5) spatial relationships (on the
The task, however, proved to be a rather frustrat- Frostig assessed by copying of patterns by linking
ing situation for many clients. From a psycho- dots).
metric point of view, this is a negative aspect, but The Frostig can be administered individu-
from a clinical point of view, the D-48 may well ally or in small groups and requires somewhere
provide a situation where the ability to tolerate between 30 to 60 minutes to administer. The
frustration and to cope can be observed. 1963 standardization sample consisted of Cali-
fornia children aged 4 to 8, with 93% coming
from “middle class” homes, and including very
Perceptual-motor assessment. The assessment few minority children. Thus the sample does not
of perceptual-motor development and abilities is seem to be representative. The test yields three
a complex field, strongly affected by theoretical sets of scores:
viewpoints, as well as the difficulty of relating
research findings to practical applied procedures 1. A perceptual-age score for each subtest and
in the classroom. for the total test is defined in terms of the perfor-
From a theoretical point of view, the study mance of the average child in each age group.
of perception, and particularly visual perception, 2. A scaled score is also available for each sub-
has been dominated by the theoretical positions test, but these scores are not standard scores as
of nativism and of empiricism. The nativist posi- found on tests like the WISC, and their rationale
tion views visual perception as genetically deter- is not indicated; some authors recommend that
mined and basically a passive process in which such scores not be used (e.g., Swanson & Watson,
external stimulation falls on the retina; the image 1989).
is then transmitted to the brain. This point of view 3. A perceptual quotient can also be obtained as
gives little attention to perceptual development a deviation score based on the sum of the subtest
because it is assumed that infants perceive their scale scores corrected for age variation. Because
environment as adults do (Wertheimer, 1958). this quotient is based on the scaled scores, it too
Empiricists consider perception as an acquired is suspect.
ability. The perceptual world of the infant is seen
as unpredictable and changeable; only through Three test-retest reliability studies are reported
learning does the chaos become orderly. Either by the test authors. In two of the studies, the
position does not seem adequate to explain examiners were trained psychologists, and the
perception fully, and a number of attempts obtained test-retest coefficients ranged from .80
have been made to combine the two into a to .98 for the total test score, and from .42 to
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

312 Part Three. Applications of Testing

.80 for the subtest scores. In the third study, the accounts for about 40% to 60% of all childhood
examiners were not psychologists; the test-retest deafness; in about half of these cases there are
correlation coefficients were .69 for the total test other disabilities as well. Other causes of deaf-
score, and from .29 to .74 for the subtest scores. ness include rubella (German measles), meningi-
Split-half reliability for the total test score ranged tis (infection of the membranes that surround the
from .89 at 5 and 6 years of age to .78 at 8 and brain), and premature birth (Vernon & Andrews,
9 years of age. In general, the reliability of the 1990). A crucial variable in the hearing impaired
total test is adequate, but the subtest coefficients is the age at which the person became deaf.
are too low to be used for individual differential There are those who become deaf prelingually,
diagnosis. as infants, before they develop skills in spoken
A number of factor-analytic studies of the language; as a consequence they do not achieve
Frostig by itself have reported a single factor normal competence in that language. Often the
rather than the five hypothesized by the test result is not only a deficit in verbal language but in
authors (e.g., Corach & Powell, 1963). Other reading, writing, speaking, and hearing. (There
studies that included additional tests, such as an are also individuals who are postlingually deaf,
intelligence test, found multiple factors, with var- prevocationally deaf, and late deafened.)
ious of the Frostig subtests loading on different In the assessment of intelligence, it is impor-
factors, but either way the results do not support tant to use nonverbal rather than verbal tests.
the authors’ initial theoretical framework (e.g., In fact, nonverbal tests of intelligence typically
Becker & Sabatino, 1973). show that the intelligence of the hearing impaired
is normally distributed, while verbal intelligence
Vision testing. One of the most common devices tests show a decline of about one standard devi-
used to assess vision is the familiar Snellen chart ation below the norm. In general, the use of ver-
or test. Such charts use rows of letters of gradually bal language to measure intelligence, personality,
diminishing size, the letter E in various positions, aptitudes, or other domains, may not be valid for
a broken circle with the opening in various posi- the hearing impaired because the results may well
tions, or other stimuli. Rabideau (1955) reported reflect the person’s language limitations. In many
split-half reliabilities of .98 for the Snellen let- hearing-impaired individuals, the picture is also
ter series under carefully controlled conditions; complicated by the presence of additional handi-
Guion (1965a) indicated that under the nonlab- caps that include neurological damage, as well as
oratory conditions in which the Snellen chart is cultural and attitudinal aspects (Holm, 1987).
typically used, its reliability may be as low as .20.
Early identification. The early identification of
Bibliographies. Various bibliographies of tests language-related deficits such as hearing impair-
that can be used with the visually impaired are ment is of substantial importance. Not only is
available. An example is the book by Scholl and language central to most human activities, but it
Schnur (1976), which provides a listing and refer- is probably the major aspect of most education
ences for a wide variety of measures, from tests of and becomes increasingly important as the child
cognitive functioning to measures of personality. progresses developmentally. The gap between the
language-impaired child and the normal child
widens with age. Early language deficit is also
HEARING IMPAIRED
related to subsequent learning and behavioral
problems.
General issues
Usually, a preschool or kindergarten screening
Prelingual onset. Hearing impairment can also program will use a test or test battery for language
cover a wide range of impairment from mild assessment, as well as teacher rating scales. There
losses that are barely perceptible to total deafness, are quite a few such tests; for example, Swanson
and thus the population subsumed by this label is and Watson (1989) list 91, and the interested
rather complex and defined by multiple criteria. reader might wish to consult more specialized
For example, one criteria is degree of hearing loss; sources (e.g., Illerbrun, Haines, & Greenough,
ranging from hard of hearing to deaf. Heredity 1985; Schetz, 1985).
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 313

Very often, the use of different instruments have been validated on hearing-impaired samples
results in different children being identified as (E. S. Levine, 1971; 1974; McCrone & Chambers,
needing remedial work. Here the concepts of false 1977).
positive and false negative are particularly rele-
vant. In using a screening test, to what extent do Understanding test instructions. For people
we tolerate errors of identification? Are we will- who are hearing impaired, probably the most
ing to label certain children as having language important testing consideration is how well they
deficits when in fact they may not? Is it worse understand spoken instructions, and how well
not to identify the child who indeed needs that they speak. In testing hearing-impaired persons,
remedial work? it is important to make sure that they understand
the test instructions (M. Vernon, 1970). Although
American Sign Language (ASL). In the United many tests have written instructions, this does
States, the native language of the deaf is American not necessarily mean that the hearing-impaired
Sign Language (ASL), which has grammar and person will understand. Studies have shown that
syntax different from English; it can be argued in the population of deaf adults there is a dispro-
that deaf people are a minority language group, portionate number of people with low reading
and many of the issues discussed in Chapter 11 levels (Nester, 1984). Thus, a sign language inter-
are appropriate here. preter may be needed. In fact, in testing the hear-
Just because there is a qualified interpreter, ing impaired the major difficulty experienced
who translates questions that are presented orally by professionals concerns the tests themselves –
into American Sign Language, does not mean that problems of administration, interpretation, and
the testing situation has now been made equiv- the lack of norms (E. S. Levine, 1974).
alent to that of a hearing person. For one, there Prelingually deaf people tend to have a defi-
needs to be almost constant eye contact between ciency in verbal language skills, which presents
the deaf subject and the gestures being made by a problem. If the test is one that measures ver-
the interpreter. Such eye contact can be easily bro- bal language skills, then the performance of the
ken by all sorts of distractions. Second, ASL is hearing-impaired person will reflect the degree
a vehicle for informal social discourse and not of deficiency. But if the test is measuring some-
necessarily for formal presentation. Finally, ASL thing else, for example, self-esteem, through ver-
is really a language of its own, rather than repre- bal items, then the result may to an extent reflect
senting a strict translation. Thus, if the language is the language deficit rather than the variable of
structured so as to conform to the English syntax, interest.
the result may be an awkward or garbled transla- How can these challenges be met? One way
tion (Hoemann, 1972). is to use nonverbal items, for example, the per-
formance subtests of the WAIS rather than the
Limitations of agencies and examiners. Quite verbal subtests. The majority of studies on meth-
often, hearing-impaired clients who come to the ods to assess the hearing impaired draw atten-
attention of some agency, perhaps in a school tion to the fact that instruments used must have
setting or a vocational rehabilitation agency, are nonverbal performance-type responses and must
sent to a private psychologist for testing. The staff also contain nonverbal directions (e.g., Berlinsky,
at the agency often lacks either the time or the 1952; Goetzinger, Wills, & Dekker, 1967; M. Reed,
expertise to do specialized testing; unfortunately 1970). Another approach is to rewrite the verbal
many private psychologists lack testing experi- items in language appropriate for the deaf, i.e.,
ence with the hearing impaired. It is not sur- language that reflects the grammatical structure
prising that the literature suggests great caution used in sign language; this approach however, is
and limited effectiveness of tests with the hearing quite limited. Still a third approach is to translate
impaired (e.g., M. Vernon & D. W. Brown, 1964; the test items into sign language by the use of a
Vescovi, 1979; Watson, 1979). Surveys of agen- sign-language interpreter (e.g., Hoemann, 1972).
cies that serve the hearing impaired, other than Administration of tests to the hearing impaired
school facilities, in fact indicate that very little should be done by trained examiners, those who
testing is done, and very few tests are available that have had experience working with the hearing
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

314 Part Three. Applications of Testing

impaired. Group testing is probably not a valid Locus of Control Scale (Koelle & Convey, 1982).
or useful procedure, except perhaps for screening Such modifications can take various forms. For
purposes. example, Leigh, Robins, Welkowitz, et al. (1989)
Bragman (1982) found that different methods used the Beck Depression Inventory in a study of
of conveying test instructions – for example, pan- depression in deaf individuals. They modified the
tomime vs. demonstration – affect deaf children’s items by rewriting them with a fourth-fifth grade
performance, at least on pattern-recognition vocabulary level. A different example involves
tasks. How generalizable these results are to other the K-ABC where the child is asked to identify
testing domains is at present not known. a picture such as that of a typewriter. The type-
writer is similar in appearance to a telecommuni-
Lack of norms and appropriate tests. Bragman cation device for the deaf (TDD), so a response of
(1982) listed 22 tests commonly used with the TDD was also judged acceptable (Ulissi, Brice, &
hearing impaired. These tests covered three Gibbins, 1989).
domains: intelligence (e.g., the Wechsler tests), Among the test modifications that have been
achievement (e.g., the Stanford Achievement used are the omission of verbal items, addition of
Test), and personality (e.g., the 16 PF). Of the 22, practice items, addition of demonstrations of task
only one, the Hiskey-Nebraska Test of learning strategies (i.e., ways of trying to solve test items),
aptitude, was originally developed for hearing- testing the limits, use of additional printed or
impaired individuals. For 14 of the 22 tests, mod- signed instructions, or the use of pantomime
ifications were made for use with the hearing for instructions. The most frequent modifica-
impaired, but for only 4 were these modifications tion seems to be changes in the test instructions,
standardized and for only 3 were norms available through pantomime or sign language, for exam-
for the hearing impaired. ple (Bragman, 1982). Although such modifica-
Careful use of norms needs to be made. Hear- tions can be quite useful both for research pur-
ing norms are appropriate if we wish to obtain poses and for individual assessment, it must once
some knowledge of how the person might func- again be recognized that such modifications can
tion in a competitive environment. Same-group change the meaning of the available norms and of
norms are more appropriate if we wish to obtain the client’s obtained score. Basically it is impor-
a profile of the person’s strengths and weaknesses. tant to determine the effect of such modifications
– yet little information is available on this topic.
Modified tests for hearing impaired. Some In testing the hearing impaired more time is
instruments, particularly performance tests that often required. There is a need for examples and
have simple directions, are used as is with normal introductory trials before beginning the actual
hearing norms. Good examples are the perfor- testing. Quite often test forms, especially those
mance subtests from the Wechsler or the Raven’s developed on normal samples, do not have these
Progressive Matrices. Other instruments are also examples. An alternative procedure is to use items
used as is, or with minor modifications, but offer from an alternate form of the test as trial items
norms based on hearing-impaired samples. With (one more reason why alternate forms of tests can
few exceptions, these norms are based on small be useful).
and atypical samples that often represent the cap- In a few cases, tests have been developed specif-
tive clientele of a particular agency. An example ically for use with the hearing impaired. Some-
is the 16 PF (Trybus, 1973), for which norms of times these tests are not readily available because
hearing-impaired college students are available. they were developed for use at a specific agency,
Even with a well-established test such as the 16 PF, and sometimes when they are available, they are
the literature suggests that it may not be reliable severely criticized for their inadequacies (e.g.,
or valid with the hearing impaired (e.g., Dwyer the Geist Picture Interest Inventory; see Farrugia,
& Wicenciak, 1977; Jensema, 1975a, b). As with 1983).
the visually impaired, tests to be used with the
hearing impaired often need to be modified, for Schemas. Tests sometimes use stories, vignettes,
example, the verbal items from the WAIS (Sachs, or written material such as a sentence, as test
Trybus, Koch, & Falberg, 1974) or the Rotter items. The child may be asked to agree or disagree,
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 315

to complete a story, to come to some conclusion, Some Selected Findings


or in general to process the material in some
Measures used. What are the tests that are used
way. Some researchers have concluded that to
for testing the hearing impaired? S. Gibbins
comprehend and recall such material, the child
(1989) conducted a national survey of school
needs to actively construct meaning by applying
psychologists who serve the hearing impaired.
prior knowledge and experiences rather than pas-
Among the most popular measure of cogni-
sively receive the information (e.g., Chan, Burtis,
tive abilities were the WISC-R Performance
Scardamalia, et al., 1992). This prior knowledge
Scale, the Leiter International Performance Scale,
is stored in memory structures that are called
the WAIS-R Performance Scale, the Hiskey-
schemas. For example, there is a story schema, an
Nebraska Test of Learning Aptitude, the WISC-
expectation that a story has a setting, an initiating
R Verbal Scale, the Ravens Progressive Matri-
event, a series of episodes, and a resolution. Stud-
ces, and the Stanford-Binet, in that order. The
ies suggest that children who read below their
Vineland Adaptive Behavior Scale was the most
grade level have poorly developed story schemas.
commonly used test to assess adaptive behavior,
There is some suggestion that hearing-impaired
and the Wide Range Achievement Test was the
children have less well-developed schemas than
most common test of educational achievement.
those of hearing children (Schirmer, 1993). Such
Figure drawings were among the most common
findings have repercussions not only for educa-
measures of social-emotional status. This use of
tional practices such as teaching reading com-
various psychological tests reflected little change
prehension to deaf children, but also for testing
from earlier reports, and probably current usage
practices.
is much the same.
Why test the hearing impaired? Given all the
difficulties and the test limitations, one might Projective drawings. Projective drawings are
well wonder why make an effort to test the hear- among the most popular assessment techniques
ing impaired? There are a number of obvious used with the hearing impaired. Cates (1991) sug-
answers including the need for appropriate edu- gests three reasons for this: (1) they are easy to
cational placement, the identification of skills administer, (2) they are basically a visual-spatial
and aptitudes, and the goal of having impaired technique and so seem to be appropriate for a
individuals participate more fully in mainstream population that relies predominantly on visual-
activities. A serious attempt is being made in spatial cues, and (3) they show consistently high
the United States to identify children with sig- reliability (some would disagree with this point).
nificant hearing impairment during their first Unfortunately, there is little evidence of validity
year of life. Children who are thus identified of these measures with the hearing impaired, and
and receive early intervention achieve greater most clinicians who use these techniques assume
receptive- and expressive-language scores than that if they are valid for the normal hearing then
children who receive later or no intervention they must be valid for the hearing impaired.
(Barringer, Strong, Blair, et al., 1993).
One purpose that may not be so self-evident
is illustrated by Aplin (1993), who describes a Cognitive tests. Early studies of the intelligence
British medical program where the cochlea is of the hearing impaired suggested that as a group,
implanted in deaf adults to ameliorate their deaf- these individuals averaged 10 to 15 IQ points
ness. A battery of psychological tests is used to below normal. Later studies indicated that if
assess the suitability of the patient for implant, appropriate nonverbal or performance measures
as well as to monitor the psychological progress were used, the average scores were comparable
of the patient and to assist when the patient with those of the hearing population. At the same
may become disheartened at the perceived lack of time, it has been suggested that there are differ-
progress. Most of the tests used in this program ences between hearing impaired and normal on
are tests we discussed, including selected scales the qualitative aspects of intelligence, for exam-
from the British Ability Scales, the Wechsler, and ple, that the hearing impaired are more concrete
the Hiskey-Nebraska. and less creative in their thinking.
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

316 Part Three. Applications of Testing

The need for reliable and valid tests for the Braden (1990) did an analysis of the litera-
evaluation and guidance of hearing-impaired ture and located 21 studies where a version of
children has long been recognized, with many the Wechsler Performance Scale was adminis-
attempts to modify tests for use with this pop- tered to deaf subjects, and means were given for
ulation. The studies of Pintner and Paterson at least 5 subtests. A meta-analysis of these 21
(1915) with the Binet scale represent one of the studies indicated that although deaf persons have
earliest studies to test the intelligence of deaf mean Performance IQs slightly below normal-
children. These investigators found that deaf hearing norms, they are well within the average
children scored in the mentally retarded range, range. The only exception occurs on the Cod-
but attributed this to language deprivation, and ing/Digit Symbol subtest, where the mean score
developed a nonlanguage test (the Pintner Non- is markedly lower (a mean of 8.77 versus the
language Test; Pintner, 1924). However, later expected 10). These findings were quite consis-
studies showed that deaf children continued to tent across studies. The significantly lower mean
score significantly below the norm of hearing on Coding/Digit Symbol was consistent across
children (M. Vernon, 1968). gender, degree of hearing loss, and version of the
Subsequent research indicated that verbal tests Wechsler used, but was increased by the type of
of intelligence were inappropriate, that group administration procedure used (e.g., standard vs.
tests of intelligence were of questionable validity oral administration), the presence of additional
with the hearing impaired, and that appropriate handicapping conditions, and the use of norms
instruments must have nonverbal performance based on deaf persons. Several explanations for
type of items as well as nonverbal directions this finding have been proposed, including the
(M. Vernon, 1968). The Stanford-Binet is not failure of the person to understand the speeded
generally recommended by experts for use with nature of the task, underdeveloped language skills
the hearing impaired, because of its heavy needed to mediate the task, the prevalence of neu-
emphasis on language ability. ropsychological deficits among deaf persons, and
Braden (1985) showed that the factorial struc- even test bias (Braden, 1990).
ture of nonverbal intelligence is basically identical In discussing the Wechsler tests in Chapter 5,
in hearing-impaired and normal hearing subje- we discussed the concept of subtest variability. At
cts, with one major factor g explaining much of a theoretical level, we would expect a person to
the variance. These results suggest that the non- perform equally well, or equally poorly, across all
verbal intelligence of the hearing impaired does subtests, but in fact we typically obtain variabil-
not differ qualitatively from that of normal hear- ity, i.e., a person might do better on some sub-
ing, as some theories have suggested (for a differ- tests and less well on others. We saw that there
ent conclusion, see Zwiebel & Mertens, 1985). have been attempts to relate such variability to
diagnostic status. Hearing-impaired individuals
The Wechsler tests and the hearing impaired. show a typical group profile. They perform rel-
The Wechsler Performance Scales are the most atively well on the Picture Completion subtest,
popular test for assessing deaf children and adults suggesting above-average ability to discern essen-
and seem to demonstrate adequate reliability tial and unessential details. They perform some-
and validity with deaf subjects (e.g., Braden, what below average on the Picture Arrangement
1984; Hirshoren, Hurley, & Kavale, 1979; Lavos, subtest that requires the subject to correctly place
1962). in order a sequence of “cartoon” panels to make
There are several reasons why the Wechsler a story. Whether this reflects poorer social skills
scales are so popular. First, they represent a series based on more restricted social experiences, or
of “equivalent” tests, so that a young child may poorer sequential skills that reflect the lack of
be tested with the WISC and several years later continuous auditory stimulation, is debatable.
retested with the WAIS. Second, most school The hearing impaired perform above average on
counselors and psychologists are trained on the the Block Design subtest, where a design is to
Wechsler and have Wechsler kits readily avail- be reproduced using wooden blocks, and on the
able, whereas tests such as the Hiskey-Nebraska Object Assembly subtest, essentially a series of
are not used with normal children and often are puzzles. They do less well on the Coding subtest,
not readily available in the school setting. which requires a number of skills. Whether this
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 317

“typical” profile reflects more “concrete” abilities more than 1,200 deaf children (R. D. Anderson
and lower capacity to deal with abstract material & Sisco, 1977). Older studies (e.g., E. E. Graham
is a moot question (D. W. Hess, 1969). & Shapiro, 1963) administered the performance
A typical study is that by D. R. Ross (1970), who scale of the WISC by pantomime, with positive
administered the WAIS to deaf students aged 16 results.
to 21, by both manual and verbal communica-
tion. The mean Performance IQ was 106, but the Culture-fair tests. Culture-fair tests, and partic-
mean Verbal IQ was 72. ularly the Raven Progressive Matrices (see Chap-
It should be noted that, although there are ter 5) have been used quite frequently to assess the
many articles on the WISC-R as applied to deaf intellectual level of deaf persons, with results that
children, there are relatively few studies of the correlate relatively well with Performance IQs on
WAIS-R with hearing-impaired adolescents or the Wechsler (e.g., Naglieri & Welch, 1991).
adults (Ensor & Phelps, 1989).
The Peabody Picture Vocabulary Test (PPVT).
The WISC-R and deaf children. The WISC-R Another cognitive test that has been useful with
Performance scale (PS) is probably the intel- the hearing impaired is the PPVT (see Chapter 9),
ligence test used most often by psychologists which consists of a series of plates with drawings
assessing deaf children. Despite this, there are on them. A word is given by the examiner and the
actually few studies of the validity of the WISC- child identifies the corresponding drawing. Forde
R PS with deaf children. Hirshoren, Hurley, & (1977) describes its use at a Canadian school as a
Hunt (1977) administered the PS and the Hiskey- screening device, with two modifications: where
Nebraska to a sample of 59 prelingually deaf chil- necessary the printed vocabulary word was used
dren, average age about 10. The average IQ on the rather than the spoken one, and rather than stop
WISC-R PS was 88, with IQs ranging from 52 testing after six out of eight errors, it was up to the
to 129. The intercorrelations among PS subtests examiner to determine when the test should be
were quite similar to those obtained with hearing discontinued. For a sample of hearing-impaired
children. Scores on the PS correlated .89 with the students tested during a 6-year period (N = 196),
Hiskey-Nebraska learning quotient scores; this the average IQ was 104 (SD = 15). Correla-
and other analyses indicate the two tests are fairly tions of the PPVT with scores on the Stanford
interchangeable. Braden (1989) compared the Achievement Tests ranged from a low of .49 (with
WISC-R PS scores for a sample of 33 prelingually Arithmetic Computation) to a high of .70 (with
severely deaf children, with their scores on the Language).
Stanford Achievement Test (hearing-impaired
edition). The correlations between the two sets Stanford Achievement Test-Special Edition. In
of scores were rather low, ranging from a high 1974, the Office of Demographic Studies of Gal-
of .37 to a low of −.08, with a median r of laudet College (a famous college for the hearing
about .14. In a second study, a number of non- impaired) developed a special edition of the 1973
verbal tests of intelligence (such as the Hiskey- Stanford Achievement Test for use with hearing-
Nebraska) were also used. Again, the correla- impaired students (SAT-HI; Trybus & Karchmer,
tion between nonverbal IQ and achievement 1977). This special edition was standardized
was rather low. The author questions the crite- nationally on a stratified random sample of
rion validity of the WISC-R PS with deaf chil- almost 7,000 hearing-impaired children and ado-
dren, but also points out that academic achieve- lescents. Like the Stanford, this form is a full-
ment may not be an appropriate criterion mea- range achievement test, composed of six different
sure for nonverbal IQs; each involves different levels or batteries – from primary level through
psychological processes, and academic achieve- advanced levels, covering grades 1 to 9. There are
ment is attenuated (i.e., lowered) by hearing four core-subject areas: vocabulary, reading com-
loss. prehension, mathematics concepts, and mathe-
On the other hand, the WISC-R is one of the matics computation.
few standardized tests for which there is a stan- The SAT-HI is a group test to be administered
dardized version, of the performance subscales in the classroom using whatever method of com-
only, for deaf children, based on a sample of munication is normally used. Practice tests are
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

318 Part Three. Applications of Testing

provided for each level and subtest. The test is (TSCS; Fitts, 1965). Self-concept is basically the
designed for ages 8 to 21, and the norms are based sum total of the perceptions an individual has
on students in 119 special-education programs about him or herself. A positive but realistic
throughout the United States. self-concept seems to be associated with optimal
The results of nationwide applications of this development. Self-concept is an important vari-
special edition show that for hearing-impaired able for disabled individuals as well, with a num-
students aged 20 or above, the median reading ber of studies looking at self-concept and its ram-
score is equivalent to a grade norm of 4.5; that is, ifications for individuals with specific disabilities
half of hearing-impaired students can barely read such as hearing impairment (e.g., H. B. Craig,
at a newspaper literacy level. For math compu- 1965; Farrugia & Austin, 1980; Loeb & Sarigiani,
tation, the results are somewhat better, with the 1986; Yachnick, 1986). Typical findings of low-
median score equivalent to just below the eighth- ered self-esteem in the hearing impaired may,
grade level. As the authors indicate, the over- in part, be a function of the instruments used
whelming majority of hearing-impaired children because such instruments are typically developed
leave school in their late teens at a very sub- on normal samples.
stantial educational disadvantage compared with There is a wide variety of self-concept mea-
their normal-hearing peers (Trybus & Karchmer, sures, most useful with individuals who have
1977). disabilities. At the same time, there are at least
The average test-retest correlation coefficient two concerns: (1) the reading level required; and
is about .83, and the standard error of measure- (2) use of inappropriate items in relation to a par-
ment is about 3. The test-retest reliability of this ticular disability (e.g., “I can hear as well as most
special edition was assessed over a 5-year period other people”). If these concerns seem strong
for a national sample of hearing-impaired stu- enough, then there are two alternatives: to revise
dents (Wolk & Zieziula, 1985). Despite this rather existing instruments, which would mean deter-
extended time period, the results indicated sub- mining the reliability and validity of the revised
stantial stability over time. Interestingly, the test- form (see Jensema, 1975a, for an example) or cre-
retest coefficients were consistently lowest for ating a new instrument specifically designed for
black hearing-impaired students. Because the test a particular disabled population.
items of the SAT-HI are identical to those of Gibson-Harman and Austin (1985) revised the
the Stanford, it is assumed that the validity of TSCS by simplifying the language structure of
the Stanford (which is generally excellent) also the items, shortening sentences, and lowering
applies to the SAT-HI. the vocabulary level required. Of the 100 TSCS
items, 79 were thus changed. These investiga-
The SAT and the hearing impaired. Ragosta and tors then studied three samples of individuals:
Nemceff (1982) studied the SAT performance of normal-hearing persons, deaf persons, and hard-
hearing-impaired students who had taken the of-hearing persons. The normal-hearing persons
nonstandard administration of the SAT. As a were administered the original TSCS followed by
group, their means were between .5 and 1.2 stan- the revised TSCS, some 2 to 4 weeks later (one
dard deviations below the means of hearing stu- half received the revised first and then the origi-
dents, with verbal scale performance more dis- nal). The deaf and hard-of-hearing persons were
crepant than mathematical scores. D. H. Jones administered the revised form and retested some
and Ragosta (1982) investigated the validity of the 2 to 4 weeks later. For the normal-hearing group,
SAT vs. the criterion of first-year college grades. the correlation of total scores between original
For a sample of deaf students attending a Cali- and revised forms was .85, indicative that the
fornia state university, the SAT verbal correlated two forms are fairly equivalent. For the deaf and
.14 with grades, while the SAT mathematical cor- hard-of-hearing samples, the test-retest correla-
related .41; for hearing students the coefficients tions were .76 and .89, indicative of adequate test-
were .38 and .32, respectively. retest reliability.
Oblowitz, Green, and Heyns (1991) chose to
Self-concept. In Chapter 8, we discussed self- develop a new instrument, specifically for the
concept and the Tennessee Self-Concept Scale hearing impaired. The Self-Concept Scale for the
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 319

Hearing Impaired (SSHI) is a 40-item Likert-type of .32 for the total score. In particular, when
self-report scale. These items were selected from ratings were made by school teachers rather
a pool of 80 statements that had been written than by professionals experienced in making
on the basis of a literature review, with a num- such judgments, the resulting coefficients were
ber of items adapted from existing scales. The lower. Other approaches, such as administering
basis for selection was primarily a logical, judg- the SSHI to a group of normal-hearing ado-
mental one, rather than based on statistical crite- lescents together with other self-concept scales,
ria. Each item consists of three pictures of young yielded more promising results. A factor anal-
people, identical except for their facial and bod- ysis yielded 10 factors that the authors inter-
ily expressions; the first picture portrays a happy preted as “corresponding reasonably well to the
expression, the second, a neutral expression, and dimensions of the self-concept” included in this
the third, a sad expression, with the order of the test. Incidentally, the authors of this test are
three expressions different from item to item. from Italy and South Africa, but no mention
Each of the drawings has a statement in a car- is made whether the SSHI was developed in
toon like fashion, such as, “I do not want to the United States or where the subjects came
wear my hearing aid.” There are two forms of from. Although more efforts like this need to be
the questionnaire, one with male drawings and made, the SSHI illustrates the difficulties of doing
one with female drawings. The 40 items cover good research and some of the challenges asso-
4 areas of self-concept: personal, physical, aca- ciated with the development of tests for special
demic, and social, although the distribution of populations.
items is not equivalent. There are only 4 items
that cover personal self-concept but there are 24 Personality functioning. The literature suggests
items that cover social self-concept. Items that that there are a number of personality character-
differ only slightly in meaning are included to istics associated with hearing impairment, such
check for consistency of response, although the as neurotic tendencies, excessive anxiety, social
number and nature of such items is not indicated. withdrawal, lack of sociability, depression, sus-
Students mark the picture in each triad that best piciousness, social immaturity, and emotional
represents their response choice. The SSHI is not instability. To what degree these findings reflect
timed, with an average completion time of about attempts to cope with the deafness or realistic
30 minutes. Scoring uses a 3-point scale, with 3 repercussions of being hearing impaired, or lim-
points assigned to positive items, 2 points to neu- itations in the instruments used, is an unresolved
tral, and 1 to negative items. You might wish to issue.
review our discussion of nominal, ordinal, inter- Paper-and-pencil inventories such as the
val, and ratio scales, (Chapter 1), and consider the MMPI and the 16 PF require a fairly high read-
rationality of this scoring scheme. A total score ing level, and also use idiomatic expressions that
can be computed on the SSHI, as well as scores hearing-impaired persons may not understand.
for the four subareas. However, Brauer (1993) translated the MMPI
The SSHI was administered to 253 hearing- into American Sign Language (ASL) and reported
impaired children at three different special some basic data showing the linguistic equiv-
schools, with children ranging in age from 11 alence of the ASL and English versions of the
to 19. Test-retest reliability coefficients over an MMPI, adequate reliability, but no validity data.
interval of about 1 month were .70 for the total Rosen (1967) showed that hearing-impaired per-
scale and from .49 to .68 for the 4 subareas. The sons, whose academic achievement scores were
reliability seemed to be higher for those children sufficient to understand MMPI items, in fact
whose hearing loss was minimal, as well as for did not understand many of these items due to
those children whose communication skills were their idiomatic nature; in addition, some of the
good, with no gender differences. MMPI test items are not appropriate for hearing-
Correlations between scores on the SSHI and impaired persons. Thus, a number of researchers
ratings by professionals on the same dimen- have turned to projective tests. Many of these pro-
sions were relatively poor, ranging from a low jective techniques are of limited use because they
of .08 for the personal dimension to a high emphasize language and communication, as in
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

320 Part Three. Applications of Testing

the Rorschach and the TAT. Drawing tasks are than a reflection of their abilities. Nevertheless,
used quite often, but psychometric issues abound few studies have looked at the creativity of
with these measures. Unfortunately, much of the hearing-impaired children and fewer still at
research also is rather flawed. strategies for developing creative abilities in
A typical example is the study by Ouellette such individuals. Laughton (1988) studied 28
(1988), who administered the House-Tree- profoundly hearing-impaired children aged 8
Person (HTP; Buck, 1966) to a sample of 33 to 10. All of the children were given the figural
severely hearing-impaired young adults. The nonverbal part of the Torrance Tests of Creative
HTP requires the subject to draw pictures of a Thinking (see Chapter 8), and then half of the
person, a person of the opposite gender, a house, children were exposed to a curriculum designed
and a tree. The test was administered using both to enhance their creativity, while the other
sign language and voice. Three psychologists were half met for traditional art class. The results
then asked to rate each subject, on the basis of indicated significant improvement in flexibility
the drawings, on eight personality traits culled and originality for the experimental group.
from the literature as particularly applicable to
the deaf, namely aggression, anxiety, dependency, The Vineland Adaptive Behavior Scale. You
egocentricity, feelings of inadequacy, immaturity, recall from Chapter 9 that this test is a revision
impulsivity, and insecurity. The psychologists’ of the Vineland Social Maturity Scale. Dunlap
ratings were then compared with counselors’ rat- and Sands (1990) administered the Vineland to
ings on the same eight dimensions, except that the 118 hearing-impaired persons, with an average
counselors’ ratings were based on direct knowl- age of 20.7 years. On the basis of a cluster anal-
edge of the clients. The first question asked con- ysis (like a factor analysis), three subgroups of
cerned the interrater reliability, that is, how well subjects were identified, having low, middle, or
did the three psychologists agree with each other? high scores on the Vineland. Additional analyses
The answer unfortunately is “not very well.” indicated that the three groups differed in com-
The 24 correlation coefficients that were com- munication skills, daily-living skills, and degree
puted range from a high of .72 to a low of −.14, of socialization. It is interesting to note that the
with a median of about .32. Although the author “low” group did not have the most severe hear-
argues that for four of the scales there was ade- ing losses, as might be expected, but did have
quate interrater reliability because the correlation more members with physical impairments and
coefficients were statistically significant, for none with low IQ scores. The authors suggest that a
of the scales were the coefficients consistently classificatory scheme based on functional ability
above .70. may be more useful than one based on hearing
The author of the study then compared the loss. Other adaptive and behavioral scales that
psychologists’s ratings to the counselors’ ratings can be quite useful for the hearing impaired are
by using T tests and found no significant dif- the American Association of Mental Deficiency
ferences in mean ratings for five of the eight Adaptive Behavior Scale (1974) and the Cain-
dimensions. However, this was not the appro- Levine Social Competency Scale (Cain, Levine,
priate analysis. Two individuals could rate a & Elzey, 1977), both of which were standardized
group of subjects in drastically different ways, on the mentally retarded.
yet still give the same average rating. What
would have been appropriate is the correlation Behavioral rating scales. These scales, such as
coefficient. the Behavioral Problem Checklist (H. C. Quay
& Peterson, 1967), can be quite useful, especially
Creativity. Hearing-impaired persons have when there are norms available based on hearing-
often been described in the literature as rigid and impaired samples. The value of these scales lies in
concrete, lacking imagination, and having lim- part on the fact that they reflect the evaluation of
ited abstract and divergent thinking (Myklebust, an observer who typically has had extensive con-
1964). Recent research suggests that some of tact with the client, and they force the observer to
these findings are more reflective of the linguistic report his or her observations in a standardized
limitations of hearing-impaired children rather manner.
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 321

Test guides. A number of test guides listing tests ically disabled individuals: psychological, physi-
that are suitable for the hearing impaired are cal, and psychometric (K. O. White, 1978).
also available. One example is by Zieziula (1982), Psychological considerations include sensitiv-
which covers a wide variety of tests from aca- ity to the fact that some disabled individuals have
demic achievement measures to work evaluation limited opportunities for social interactions; for
systems. Often these guides are superficial, and these individuals, the testing situation may be
they become rapidly outdated. frightening and strange, and their test anxiety
may well cloud the results.
Physical considerations involve the verbal con-
PHYSICAL-MOTOR DISABILITIES tents of a test and the performance requirements.
Nature of physical disabilities. From a psycho- The content of some items may be inappropri-
metric perspective, physical disabilities present ate for disabled individuals – for example, a per-
a much more heterogeneous set of conditions, sonality test item that reads, “I am in excellent
some requiring no particular modifications in health.” The test itself may be inappropriate in
test or test administration and others present- that it may require speed of performance or other
ing substantial challenges. Even within a particu- aspects not in keeping with the client’s coordina-
lar disability, clients may differ dramatically from tion, strength, or stamina. When standard instru-
each other in their test-taking capabilities. ments are used with disabled individuals, often
Three of the major categories of physical dis- they are modified to meet the requirements of the
abilities that can present challenges in the test- specific situation. There may be a need for a com-
ing situation are those due to neuromuscu- fortable work space designed to accommodate a
lar diseases, major physical injuries, and severe wheelchair, or assistance in some of the manual
chronic health problems. Neuromuscular dis- tasks involved in a test such as turning the pages
eases include conditions such as cerebral palsy of the test booklet, indicating responses by ges-
and muscular dystrophy. These often involve tures, pointing, or pantomime, perhaps longer
troublesome involuntary movements, clumsy and/or more frequent test breaks, or extra time.
voluntary movements, impaired mobility, and Speed tests may also be inappropriate.
sometimes evidence of brain injury as reflected in Psychometric considerations involve the
impairments in verbal skills and in motor coor- impact that modifications of administration
dination. Physical injuries can also be quite var- have on both the normative aspects of the test
ied and may involve paralysis due to spinal-cord (e.g., Does the same raw score obtained under a
injury or orthopedic disabilities, such as injuries time limit vs. untimed mean the same?), and on
to a limb. Finally, chronic health problems can the psychometric aspects, such as validity.
range from cancer to severe allergies to condi-
tions such as diabetes and asthma. The client is the focus. Most experts agree that
the examining procedures for physically disabled
individuals need to be modified to reduce the
Motor impairment. In this chapter, we include physical barriers that can interfere with appro-
motor impairments under the more generic label priate testing. Keep in mind that the aim of test-
of physical disabilities. Motor disabilities refer to ing is typically to obtain a picture of what the
impairment in moving parts of the body, such as individual can do. The examiner should be aware
hands or arms, and cover a wide variety of con- of the nature and extent of the client’s disabili-
ditions such as cerebral palsy and quadriplegia. ties, and the client should be contacted prior to
What test modifications are needed is, of course, testing to determine what testing modifications,
a function of the specific motor impairments. For if any, are required. As always, rapport is impor-
some clients, no change from standardized pro- tant. Patience, calmness, adaptability, and tact are
cedures are necessary. Considerations needed by desirable examiner characteristics, as well as eye
others follow next. contact, emphasis on ability as opposed to dis-
ability, and acceptance of the client as an indepen-
Some considerations. There are three major dent, fully functioning individual (K. O. White,
areas of concern regarding the testing of phys- 1978).
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

322 Part Three. Applications of Testing

Disabilities and personality. Years ago, some Spinal-cord injury. There are an estimated
efforts were made to determine whether there 250,000 people with spinal-cord injury in the
were specific personality types associated with United States, and an additional 8,000 sus-
various disabilities. Mental health profession- tain such injury each year (Trieschmann, 1988).
als often spoke of the “ulcer personality,” the There are thus a number of studies on this
“tuberculosis personality,” and even the “mul- population, using measures of such variables
tiple sclerotic personality.” In fact, the evidence as adjustment (e.g., Hanson, Buckelew, Hewett,
suggests that no such personality types exist (e.g., & O’Neal, 1993), depression (e.g., Tate, Forch-
A. H. Canter, 1952; Harrower & Herrmann, 1953; heiner, Maynard, et al., 1993), and employment
Linde & Patterson, 1958). (e.g., McShane & Karp, 1993). A typical study is
that of Krause and Dawis (1992) who adminis-
SAT and GRE as examples. Developing fair and tered the Life Situation Questionnaire (LSQ) to
valid tests for individuals who are physically a sample of 286 persons who had suffered from
impaired poses unique and complex measure- spinal-cord injury. These individuals were basi-
ment questions. Some of these issues can be cally middle aged (mean of 41.9 years), but had
illustrated by the Scholastic Aptitude Test (SAT), suffered the injury as young adults (mean of 23.4
used in college admissions, and the Graduate years at injury). Most (81%) were male, and 61%
Record Examination (GRE) used in graduate- were quadriplegic.
school admissions. The LSQ was developed specifically to mea-
A major goal of an admissions-testing program sure mostly objective information on a broad
is to provide standardized assessment of scholas- range of areas relevant to persons with spinal-
tic ability and achievement that is objective and cord injury. Items include asking participants to
fair for all applicants. But how are we to test those indicate the number of weekly visitors, number
that are disabled? A standard test and the same of nonroutine doctor visits, overall self-assessed
test in braille, as we have seen require different adjustment, and degree of satisfaction in various
sensory skills, and thus may not be equivalent. areas of one’s life.
Some individuals may have disabilities whose
effects cannot be distinguished from the abilities
and skills that a test attempts to measure – for The Ostomy Adjustment Scale. Some instru-
example, how do we separate reading disability ments such as the Wechsler tests or the MMPI
from reading comprehension? are designed for a rather broad segment of the
As mentioned above, researchers at ETS have population – for example, all “normal” individ-
looked at the extensive data available on the non- uals who are at least 16, or all psychiatric patients.
standard versions of the SAT and the GRE. In Nonetheless, they can be quite useful with spe-
general, the results indicate that the nonstandard cific populations, such as the physically disabled.
versions of the SAT and GRE appear to be gen- Some instruments, however, are designed specif-
erally comparable with the standard tests with ically for a target population, a somewhat nar-
respect to reliability, factor structure, and how rower segment; an example of this is the Ostomy
items function. As to predicting academic perfor- Adjustment Scale (Olbrisch, 1983).
mance, for example in college, it turns out that More than 1.5 million persons in the
the academic performance of physically impaired United States and Canada have undergone
students tends to be somewhat less predictable ostomy surgery, with approximately 110,000 new
than that of normal students. In addition, the ostomies performed each year. A stoma is a pas-
nonstandard SAT and GRE were not compara- sage way that is surgically constructed through
ble with the standard versions with respect to the abdominal wall as an exit for body waste.
timing – i.e., disabled examinees were more likely There are several specialized procedures here,
to finish the test than their nondisabled peers, and including a colostomy which is a rerouting of
some test items near the end of the test were rel- the large intestine, often performed with older
atively easier for disabled students. These results, persons who have been diagnosed with colorec-
of course, reflect the fact that nonstandard ver- tal cancer. One of the key issues in recovery from
sions of the SAT and GRE basically give the dis- this surgical procedure is the patient’s emotional
abled candidate as much time as needed. adjustment and acceptance of the stoma.
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

Disability and Rehabilitation 323

Olbrisch (1983) attempted to develop a reli- picture is that tests can be very useful with these
able and valid measure of adjustment to ostomy individuals, but great care needs to be exercised
surgery to evaluate that process of adjustment because quite often tests need to be modified in
and the effectiveness of mutual aid groups for ways that may significantly alter the test results.
ostomy patients. The first step was to generate a The evidence suggests that such modified tests are
pool of potential items based on a review of the comparable with the original format, that relia-
literature, as well as the contributions of three bility can be quite satisfactory, but the validity in
ostomy patients and three expert professionals. most cases has not been investigated adequately.
From this pool of items, 39 were selected that
reflected a wide range of situations, applicabil-
ity to most potential respondents, and readabil- SUGGESTED READINGS
ity. The items were worded so they could be Fabiano, R. J., & Goran, D. A. (1992). A principal
responded to on a 6-point Likert scale (e.g., “I component analysis of the Katz Adjustment Scale in
can lead a productive and fulfilling life despite a traumatic brain injury rehabilitation sample. Reha-
my ostomy,” and “I feel embarrassed by my bilitation Psychology, 37, 75–86.
ostomy, as though it were something to hide”). This is a prototypical study that takes a scale originally devel-
The scale, together with several other instru- oped for one population and asks whether that scale is appli-
ments, was mailed to a sample of 120 ostomy cable and useful with another population. Here the authors
take an adjustment scale and explore its utility with patients
patients; of these, 53 returned usable question-
who were participating in a rehabilitation program following
naires. These patients ranged in age from 19 to traumatic brain injury.
83, included 29 males and 24 females, with the
average time since surgery of about 21/2 years. Freeman, S. T. (1989). Cultural and linguistic bias in
Five of the items were eliminated because of mental health evaluations of deaf people. Rehabilita-
tion Psychology, 34, 51–63.
low item-total correlations or low variances, sug-
gesting that the items were being misinterpreted This article discusses what the author calls a “cultural minor-
ity” that is distinguished by its language – namely, the deaf.
or did not discriminate among participants. For An interesting article that focuses on psychological testing of
the 34 remaining items, the Cronbach alpha was the deaf.
.87 indicating high internal consistency. A test-
retest analysis with an interval ranging from 2 to Head, D. N., Bradley, R. H., & Rock, S. L. (1990). Use of
home-environment measures with visually impaired
6 weeks yielded an r of .72.
children. Journal of visual impairment and blindness,
Discriminant validity was established by show-
84, 377–380.
ing that scores on the scale were not significantly
A brief article that focuses on the assessment of the home envi-
correlated with measures of social desirability ronment of visually impaired children. Although the authors
and measures of self-esteem. Convergent validity do not discuss specific measures in any detail, contrary to
was shown by small but significant correlations what they state in the abstract, they make some interesting
with such variables as number of months elapsed observations about the topic.
since surgery, whether the surgery had been elec- Morgan, S. (1988). Diagnostic assessment of autism: A
tive or emergency, and whether the patient was review of objective scales. Journal of Psychoeducational
able to work or not. Assessment, 6, 139–151.
An exploratory factor analysis yielded 12 fac- Although the focus of this chapter was on “physical” disabili-
tors, with the first 5 factors accounting for 69% ties, there are a number of other diagnostic entities that could
of the variance. These factors cover such dimen- have been included: autism is a good example and the focus
of this article. This is a review of five scales for the diagnosis
sions as normal functioning and negative affect.
of autism, and is a good example of many articles that review
Based on the factor analysis and other consid- a well-defined set of scales for a specific population.
erations, the author divided the items into two
alternate forms of 17 items each. Nester, M. A. (1993). Psychometric testing and rea-
sonable accommodation for persons with disabilities.
Rehabilitation Psychology, 38, 75–85.
SUMMARY An excellent article that covers the legal and psychometric
issues related to nondiscriminative testing of persons with
In this chapter, we looked briefly at three types of
disabilities. The author is a psychometrician with the United
disabilities: visual impairment, hearing impair- States Office of Personnel Management, who has written
ment, and physical impairment. The overall extensively on the topic.
P1: JZP
0521861810c12 CB1038/Domino 0 521 86181 0 February 24, 2006 14:30

324 Part Three. Applications of Testing

DISCUSSION QUESTIONS 3. Compare and contrast testing the visually


impaired vs. the hearing impaired.
1. What are some of the basic issues involved in
testing disabled clients? 4. How might the Semantic Differential (dis-
2. In interpreting the test score of a disabled indi- cussed in Chapter 6) be used with physically dis-
vidual, we can use general norms based on a “ran- abled individuals? (For an example, see Thomas,
dom” sample or selected norms based on specific Wiesner, & Davis, 1982).
samples (e.g., blind college students). Which is 5. One of the conclusions of this chapter is
more meaningful? Is this situation different from that the validity of most measures has not
using ethnic-specific norms (e.g., based on black been investigated adequately. How can this be
students only)? remedied?
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

PART FOUR: THE SETTINGS

13 Testing in the Schools

AIM Because much testing occurs in a school setting, this chapter looks at testing
in the context of school, from the primary grades through professional training. For
each level, we look at a representative test or test battery, as illustrative of some of the
issues, concerns, and purposes of testing. The intent here is not to be comprehensive,
but to use a variety of measures to illustrate some basic issues (see R. L. Linn, 1986).

PRESCHOOL ASSESSMENT
to the screening of high-risk children. The
general objective of assessment in educational
At least in the United States, testing in the schools settings is to make appropriate decisions about
is quite prevalent. Tests are used for accountabil- children that will facilitate their educational
ity, for instructional improvement, and for pro- and psychological development (Paget & Nagle,
gram evaluation, as well as for individual student 1986). Among the various purposes for test-
diagnosis and/or placement, advancement, and ing preschool children might be: (1) screen-
graduation determinants. In any one year, more ing of children at risk – here the concepts of
than one third of all children (about 14 to 15 false positive and false negative are particularly
million) are tested, with about 70% of the tests relevant; (2) diagnostic assessment to determine
using multiple-choice items (Barton & Coley, the presence or absence of a particular condi-
1994). tion, often for the purpose of establishing eli-
Entrance into school represents a major tran- gibility for placement in a special program, as
sition point for most children in the United States well as to formulate intervention and treatment
and in most other cultures. The transition is often recommendations; and (3) program evaluation,
facilitated by a variety of preschool programs. where the test results are used to document and
Testing can provide a partial answer to a num- evaluate specific programs.
ber of key questions such as the readiness of the Neisworth and Bagnato (1986) indicate that
child to enter school, the identification or diagno- decisions based on assessment typically include
sis of conditions that may present special educa- diagnosis, i.e., assignment to a clinical category,
tional challenges, and the assessment of a child’s and prognosis, i.e., projection of status, and pro-
abilities and deficiencies. Recently in the United gram planning. Succinctly, testing can be used to
States, there has been a marked trend to evalu- place, to predict, or to prescribe.
ate children as early as possible in order to plan In the past, the focus has been on diagnostic
educational interventions and remediation. testing. However, it is now recognized that this
approach is both difficult and unproductive at
the preschool level because the young child has
Objectives of Preschool Assessment
not yet developed stable behavior, shows intrain-
Preschool assessment involves a variety of efforts, dividual variance, is difficult to test, and changes
from comprehensive developmental assessment rapidly. Some authors argue that assessment
325
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

326 Part Four. The Settings

should be prescriptive and should provide infor- the interobserver reliability can be substantial
mation relevant to school instruction (Neisworth (R. P. Martin, 1986), and a number of such obser-
& Bagnato, 1986). These authors advocate a “test- vation systems where behavior can be coded have
teach-test” approach, where test items become been developed.
instructional objectives and vice versa. They urge 3. Rating scales, filled out by the parent and/or
the use of curriculum-based assessment mea- teacher are relatively inexpensive and require lit-
sures, which are basically criterion-referenced tle time to complete and to score. These scales
tests that use the curricular items themselves as are limited by three types of errors that produce
the assessment content. unwanted variation in scores or error variance.
The first is interrater variance – different people
Some general problems. Testing preschoolers filling out a rating scale for the same child will
can represent quite a challenge. Most preschool often give different ratings. Usually, this is not a
children cannot read, and written self-report reflection of the poor reliability of the scale, but
measures, which probably represent the most rather reflects the different perspectives that dif-
common testing approach, cannot be used. Their ferent people have. A second source of error is set-
verbal and visual-motor response capabilities are ting variance. The parent sees the child at home,
also restricted – thus a preschool child may be while the teacher sees the child at school. These
unable to tell a story in response to pictures. Sim- different settings may elicit different behaviors.
ilarly, their information-processing skills may be Finally, there is temporal variance, which reflects
quite limited and their responses to questions the effect of taking a measure at one time as
may reflect such limitations. Preschool children opposed to another (D. Martin, 1986, gives a
may not be familiar with a “testing” situation and brief review of 12 such rating scales, including
may be fearful and apprehensive. Preschool chil- the Conner’s Parent Rating Scale).
dren also have a relative inability to understand 4. Projective techniques such as producing draw-
the demand characteristics of the testing situa- ings or story telling in response to a specific pic-
tion, and may not understand the need to be ture are also used. These techniques are severely
motivated in answering “test” questions. They limited for preschool children. Most require a fair
may find the smiles of the examiner not particu- degree of verbal skills that the child does not yet
larly reinforcing, and it may be difficult to assess possess. For example, in producing a drawing the
whether the child lacks the ability to answer cor- child is asked to tell what the drawing represents,
rectly or does not wish to cooperate (D. Martin, what feelings are associated with the drawing,
1986). etc. Often the drawings and other productions
of young children are either very limited or not
Assessment approaches. There seem to be five easily interpretable.
major approaches within the tradition of psycho- 5. Traditional tests that have been normed on
metric testing: children this age. The literature seems to be dom-
1. Interviews of the parents and teachers are inated by four such tests: the Stanford-Binet, the
probably the most widely used method to WPPSI, the McCarthy Scales of Children’s Abil-
assess the social-emotional functioning of the ities, and the Kaufman Assessment Battery for
preschool child. Interviews, particularly of the Children.
child directly, are quite limited from a psycho-
metric point of view, often yielding low reliability Available methods. From a somewhat different
and validity. and perhaps broader point of view, we can con-
2. Direct behavioral observation is considered to sider four methods available to assess preschool
be one of the most valuable assessment meth- children:
ods for young children. In part, this is due to
the fact that young children are ordinarily not 1. individual tests, such as the Stanford-Binet;
bothered by observation as are older children, 2. multidimensional batteries; a wide variety
and in part due to their limited verbal repertoire. of measures exist in this category, many cov-
When such observation is done systematically, ering such domains as fine and gross motor
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 327

movement, language, cognition, self-help, and at the answer. Thus, testing is not seen as static,
personal-social-emotional aspects; but as part of the test-teach-test sequence. Many
3. adaptive skill assessment measures, which of these techniques involve allowing the child to
focus on specific skills; and learn from the testing experience.
4. adaptive process measures; these involve the
assessment of complex competencies (e.g., eye
Equivalence of instruments. In Chapter 4, we
contact) that can simultaneously involve social,
discussed a number of instruments, such as the
adaptive, and cognitive abilities. (For a review
Stanford-Binet, the Wechsler, and the K-ABC,
of 28 such measures see Neisworth & Bagnato,
that can be used with this age range. From a
1986.)
psychometric point of view, one concern is the
equivalence of such instruments. Note that part
Psychometric tests. Schakel (1986) argued that of the answer is provided by the correlation coeffi-
most standardized assessment tests normed on cient – that is, do scores on one test correlate with
preschool age children, such as the Stanford- scores on the other test? But this is only part of
Binet and the WPPSI, are based on a psycho- the answer. Scores on the two tests could correlate
metric approach, and are useful in making clas- substantially, yet one test might produce consis-
sification and placement decisions, but are of tently higher IQs than the other test. If the IQ
limited use in truly understanding a child’s cog- score was then used to make practical decisions,
nitive development. In addition to these scales such as placing a child in a special program, use of
based upon a “psychometric” approach, there are different tests would produce different decisions.
also the following: Two of the tests discussed in Chapter 4 are
the Stanford-Binet and the K-ABC, both quite
1. Piagetian-based scales. The cognitive develop- useful with preschool children. In one study, the
mental theory of Piaget has served as a spring- Stanford-Binet IV and the K-ABC were admin-
board for several scales (e.g., the Concept Assess- istered to 36 preschool children, aged 3 to 5
ment Kit, Goldschmidt, & Bentler, 1968), that (Hendershott, Searight, Hatfield, et al., 1990).
attempt to measure the child’s cognitive level The authors obtained no significant differences
in accord with the various stages proposed by between the overall mean composite scores on the
Piaget. Such scales however, have been criticized two tests, and scores on most dimensions across
(e.g., Dunst & Gallagher, 1983), and have not the two tests were moderately to highly intercor-
found widespread use. related. Thus, the two tests seem fairly equivalent.
2. Comprehensive Developmental Assessment Gerken and Hodapp (1992) compared the
Tools. These are typically checklists of items Stanford-Binet L-M with the WPPSI-R in a group
drawn from descriptions of normal child devel- of 16 preschoolers, all of whom had been referred
opment. They usually cover several domains of for assessment as to whether they were eligible for
development, including the cognitive domain. special-educational services. The children ranged
Some of these checklists involve standardized in age from 3 to 6. The average S-B IQ for these
administration, while others involve informal children was 77.93, vs. a WPPSI-R mean IQ of
administration by observing the child and/or 75.62, with 10 of the 16 children obtaining higher
interviewing the parent. One example of this IQs on the S-B than on the WPPSI-R. However,
type is the Brigance Inventory of Early Devel- scores on the two tests correlated .82. Equivalence
opment (Brigance, 1978). Items that are failed then is a function not only of the test forms used
on these tests typically are targeted for interven- but the nature of the child tested.
tion because the items usually reflect observable
behaviors that reflect specific skills. Lowered reliability. One of the general findings
3. Process-oriented assessment approaches. The is that the reliability of tests administered to
main assumption of these approaches is that the young children is often quite low, even though the
identification of cognitive strategies is necessary same instrument with older children will achieve
to understand cognitive performance, i.e., what is quite respectable levels of reliability. Paget and
important is not the answer but how one arrives Nagle (1986) indicate that the lowered reliability
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

328 Part Four. The Settings

should not be taken as evidence of psychome- vision problems that may alter the testing pro-
tric weakness, but rather reflective of the rapid cedure. A third step is to identify all the relevant
developmental changes characteristic of this age persons who need to be involved in the assess-
group. Preschool children comprise a “unique” ment, such as parents, school nurse, pediatri-
population, and not simply a younger version of cian, teachers, and so on. The fourth step is to
school-aged children. There is wide variability in determine which areas of assessment need to be
their experiential background in terms of expo- emphasized. Testing often covers such areas as
sure to adults, preschool environments, peers, cognitive functioning, social-emotional behav-
demands for responsible behavior, and so on. ior, motor coordination, adaptive competencies,
and language – but in specific cases some of these
Test requirements. As we discussed in Chapter may be of paramount importance. Finally, an
12, certain children present special challenges, overall testing strategy is developed. The exam-
and a test that may be quite useful with nor- iner decides which specific tests will be used,
mal children may be of limited use with special in what order they will be administered, and
populations. Bagnato (1984) indicates that the what specific adaptations will need to be made.
major dilemma when assessing preschoolers who (See Nuttall, Romero, & Kalesnik, 1992, for an
are handicapped is finding scales that are techni- excellent overview on assessing and screening
cally adequate, yet appropriate for the child’s dis- preschoolers.)
abilities, practical for planning interventions, and
sensitive for monitoring developmental progress.
ASSESSMENT IN THE PRIMARY GRADES

Assessing social and emotional functioning. The various concerns we have discussed typ-
This area of measurement is a relatively new ically continue to be relevant as the child
one, and dates to the 1960s when compensatory- advances into the primary grades. However, a
education programs such as Head Start were new focus appears, and that is how much the
begun. These programs were based on the child achieves in school. Thus, achievement test
premise that children who presumably were not batteries become important. A typical example is
ready for first grade because of their disadvan- the California Achievement Tests (CAT) battery.
taged background, must receive preschool expe-
riences so that they would become “school ready.”
The California Achievement Tests (CAT)
Thus the assessment of school readiness, defined
not only in cognitive terms, but also in personal The CAT is one of several nationally standard-
and social terms, such as self-confidence, self- ized, broad-spectrum achievement test batteries
discipline, and positive attitudes toward others, designed to assess the basic skills taught in ele-
became a set of expectations against which to mentary and secondary schools. The CAT mea-
compare the behavior of the school child (R. P. sures basic skills in reading, language, spelling,
Martin, 1986). mathematics, study skills, science, and social
studies, and is designed for use in kindergarten
Preparing for testing. Romero (1992) suggests a through the twelfth grade. This test battery was
number of preparatory steps be taken before test- first used in 1943 and has undergone a number
ing a preschool child. First, the examiner needs of revisions, with the fifth edition published in
to have a clear and specific referral question. 1992. There are two major uses for the CAT: first,
Often the referral is made in very general terms to determine which specific skills students have
– for example, “assess this child for educational or have not mastered, and second, to compare
placement,” and so the referral source needs to students’ performance with that of a national
be interviewed to determine what information sample.
is wanted and how findings and recommenda-
tions can be related to remedial action or place- Description. The items in the CAT are multiple-
ment decisions. A second step is to study the choice items, most with four response choices.
available data determining, for example, whether They range in difficulty level, with an average dif-
the child has certain medical conditions such as ficulty level of about 50%. The items are printed
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 329

in test booklets, and beginning with grade 4, booklets are easy to read and use, and the type
there are separate answer sheets. At the primary- and graphics are legible and attractive (Carney
grade levels the test administrator reads both the & Schattgen, 1994). The test manuals, includ-
directions and the items aloud to the students. ing the directions for administration, are very
At the upper-elementary and secondary-grade clearly written and provide comprehensive direc-
levels, only the directions are read aloud. This tions. There are practice tests that allow students
is a group-administered test battery designed to experience in taking a standardized test. Braille
be administered by classroom teachers, with the and large-type editions are available for visually
help of proctors. impaired examinees.
There are three formats or combinations to the The tests at different school levels are “linked”
CAT: the Basic Skills Battery, the Complete Bat- together, both statistically and theoretically as
tery, and the Survey Tests. These formats differ in well as by actual items, so that continuity is
the number of items per subtest, and in the kind of assured. Thus a child can be tested in the fifth
scores that are reported. The Basic Skills Battery grade, and retested in the seventh grade, with
provides both norm-referenced and curriculum- assurance that the test scores are comparable,
referenced results for reading, spelling, language, rather than reflecting two different tests.
mathematics, and study skills. The Complete Bat- There is a vast amount of material available on
tery has two parallel forms and covers the same the CAT, including a guidebook for the classroom
areas as the Basic Skills Battery, plus science and teacher, a guide for test directors, comprehensive
social studies. Subtests in both the Basic Skills technical summaries, and test reviews in sources
Battery and the Complete Battery contain 24 to 50 such as the Mental Measurements Yearbook.
items. The Survey Tests provide norm-referenced
scores for the same areas as the Complete Bat- Scoring. Booklets for kindergarten through
tery, but each subtest is composed of fewer items, grade 3, and answer sheets for grades 4 through
in other words, the Survey Tests can be consid- 12 are available in both hand-scorable and
ered a “short” form of the Complete Battery. The computer-scorable formats. The publisher offers
drawback is that the Survey Tests do not provide computer scoring services with a number of
curriculum-referenced scores, and because they reporting options that provide scores for the indi-
are shorter, the standard error of measurement is vidual student as well as for specific units, such
larger. as classrooms and grades. The individual student
Each subtest has time limits, although the CAT reports contain clear and comprehensive expla-
is not intended to be a speed test. Subtest time nations of the results, so that teachers and parents
limits vary from 14 to 50 minutes, and adminis- can readily understand the test results. Test scores
tration of the Complete Battery takes anywhere are reported on a scale that ranges from 0 to 999
from 11/2 to more than 5 hours, depending on the and, because of the way it was developed using
level used. item-response theory, is actually an equal interval
scale.
Locator tests. Within a particular classroom,
there may be a fairly wide range of achievement. If Interrelationship of subtests. Do the CAT sub-
the exact test were administered to all pupils, the tests measure different domains? In fact, the sub-
brightest might well be bored and not challenged, tests intercorrelate substantially with each other,
while the less competent may well be discour- with coefficients in the .50 to .80 range. There is
aged. To minimize this, the CAT uses “locator thus substantial overlap, suggesting that perhaps
tests” that consist of 20 multiple-choice vocab- the test battery is really assessing a general con-
ulary items and 20 multiple-choice mathematics struct (g?) (Airasian, 1989). In fact, scores on the
items; the child’s performance can be used as a CAT tend to correlate substantially with scores
guideline on which level of the CAT to administer. on tests of cognitive abilities, with coefficients
in the .60 to .80 range. This could be a trou-
Some special features. The CAT has been blesome aspect, particularly because the newest
praised for a variety of aspects that reflect a edition of the CAT is said to measure more gen-
highly professional product. For example, the test eral understanding, skills, and processes, rather
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

330 Part Four. The Settings

than factual content; it is not clear, at a concep- with an overall normative sample in excess of
tual level, whether we have an achievement or an 300,000 students.
aptitude test.
The Lake Wobegon effect. One of the reasons
Reliability. In general, the reliability estimates of why achievement test batteries are revised fre-
the CAT seem satisfactory (Carney & Schattgen, quently is the need for current norms. Garri-
1994). Much of the focus is on internal consis- son Keillor, a humorist with a popular radio
tency, with K-R 20 reliability coefficients ranging program, talks about a community where, “all
from about .65 to .95; most of the coefficients the men are good looking, all the women are
are in the .80s and .90s. The lower reliabilities strong, and all the children are above average.”
occur at the younger grades and with the shorter This “Lake Wobegon effect” was applied to the
subtests. Alternate form reliability coefficients are results of national achievement tests where most
in the .75 to .85 range, and test-retest reliability school districts using the test were reporting
coefficients in the .80 to .95 range. above-average results. The reason for this, in part,
has to do with the recency of available norms.
Content validity. As mentioned in Chapter 3, Because children are learning more, a compari-
content validity is of major importance to educa- son of their performance vs. “older” norms will
tional tests, and so it is not surprising that authors yield a more positive comparison than using
of achievement test batteries place great emphasis more recent norms (Linn, Grave, & Sanders,
on content validity, often at the expense of other 1990). Unfortunately, the results of a test bat-
types of validity. tery like the CAT often take on a life of their
The content validity of the CAT was built into own and are misused as a yardstick against which
the test from the beginning, as it should have. to measure the performance of teachers and the
Educational objectives to be measured were spec- whole community. (As I write this, the newspa-
ified by reviewing curriculum guides, instruc- pers report a national scandal in which teach-
tional programs, textbooks, and other relevant ers altered the answers of their pupils to obtain
materials. Individual items were written by pro- higher test scores.)
fessional item writers, with vocabulary difficulty
and readability closely monitored. Both teachers Overall evaluation. Carney and Schattgen
and curriculum experts reviewed the items, and (1994) note that the development of the CAT
special steps were taken to ensure that gender, was carried out in a very thorough and pro-
ethnic, or racial bias were avoided. For the fifth fessional manner, and that efforts to achieve
revision, new items were administered and cross- content validity were “first-rate.” The CAT is
validated with representative samples of students, also praised for its clarity and ease of use, and
and the results analyzed using item-response for the variety of score reports available, so that
theory. a particular school system can choose what best
Unfortunately, there is less information avail- fits its needs.
able on other types of validity. For example, the The CAT is generally seen as a well-constructed
test manuals contain evidence that mastery of a state-of-the-art battery, that compares very
particular content area, as assessed by the CAT, favorably to other achievement-test batteries
increases within a grade level from Fall to Spring, such as the Iowa Tests of Basic Skills, the
and increases across grade levels. Metropolitan Achievement Tests, and the Stan-
ford Achievement Tests. However, it is faulted
Norms. Both Fall and Spring norms are avail- for not providing complete information on test-
able, so that depending on the time of test admin- retest reliability and construct validity (e.g.,
istration, a better comparison can be made. The Wardrop, 1989).
norms are based on sizable samples of pupils at
each level (typically around 10,000), selected on
Teacher Rating Scales
the basis of a stratified random-sampling pro-
cedure, with minority groups well represented, These rating scales of childhood behavior prob-
with Catholic and private schools included, and lems are used widely by school psychologists.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 331

They are easy to administer and score, and can correlation coefficients computed; these ranged
provide a summary evaluation of the child’s from a low of .69 to a high of .89, indicating
behavior in the setting where they spend most adequate reliability. A correlational analysis indi-
of their time – the classroom. These scales can be cated that some of the scales correlated signif-
quite useful for direct measurement and can pro- icantly with each other, with correlation coef-
vide a guide for later interviews with the teacher, ficients ranging from a low of −.13 (between
parent, or for direct observation. These scales can factors 4 and 5), to a high of .64 (between factors
also be used to evaluate treatment programs, such 1 and 2). One of Neeper and Lahey’s conclusions
as the effect of medications or of behavioral inter- was that currently published teacher-rating scales
ventions. Many teacher-rating scales have been are not sufficiently comprehensive.
developed, often with a specific focus that is dif-
ferent from scale to scale.
HIGH SCHOOL
Neeper and Lahey (1984) undertook a study
to determine what the common dimensions of Social Competence. One of the critical aspects
teacher-rating scales might be, and whether these of adolescence is the development of social com-
dimensions might be related to well-established petence. This is a complex construct that prob-
factors of maladaptive behavior. They developed ably involves such aspects as achieving age-
a 60-item teacher-rating scale designed to reflect appropriate goals and having good social skills.
a broad range of childhood behavior problems Cavell and Kelley (1994) developed a self-
and cognitive deficits. They asked 26 teachers to report measure of social competence to iden-
rate a total of 649 children in the second through tify adolescents experiencing significant interper-
fifth grades. A factor analysis indicated five mean- sonal difficulties. They first asked a large sample
ingful factors: of 7th, 9th, and 11th graders to answer an open-
ended questionnaire and describe situations that
I. A Conduct disorders factor accounted for “did not go well” in a variety of areas. They
60.9% of the common variance and was defined obtained a total of 4,005 such problem descrip-
by such items as “disrespectful to teacher” and tions that were then sorted into single categories,
“fights with other children.” with redundant items eliminated. This yielded
II. Inattentive-perceptual factor accounted for a pool of 157 discrete problem situations (e.g.,
15.2% of the common variance and was defined Friend ignores you; Sibling refuses to let you bor-
by such items as “starts to work before making row something). Adolescents were then asked to
sure of directions” and “confuses visually similar rate each situation on 5-point Likert response
words or numbers.” scales as to how often the situation had occurred
III. Anxiety-depression accounted for 9% of the (frequency) and how difficult it had been to deal
common variance and was defined by such with the situation (difficulty). A factor analysis
items as “appears tense or nervous” and “seems yielded seven factors that comprised 75 items.
depressed.” These factors were labeled: (1) Keep Friends (e.g.,
IV. Language processing accounted for 7.9% of Friend tells others your secrets); (2) Problem
the common variance and was defined by such Behavior (e.g., You want to drink alcohol, but
items as “does not speak clearly and fluently” and your parents object); (3) Siblings (e.g., Sibling
“doesn’t seem to think in a coherent, logical fash- embarrasses you in front of your friends); (4)
ion.” School (e.g., Teacher is mean to everyone includ-
V. Social competence accounted for 7.2% of the ing you); (5) Parents (e.g., Parents are too nosy);
common variance and was defined by such items (6) Work (e.g., You dislike your job and your boss,
as “is able to join in ongoing group activities eas- but you need the money); (7) Make Friends (e.g.,
ily” and “cooperates actively with other children Peers don’t like you because of your appearance).
in a group.” Each of the seven scales could then be scored
for frequency and for difficulty. Internal con-
These five factors were used to develop five sistency, coefficient alpha, for the seven scales
corresponding scales. A subsample of 45 chil- seem substantial. For the frequency scores alphas
dren was re-rated 2 weeks later and test-retest ranged from .79 to .90 (median α = .86), and
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

332 Part Four. The Settings

for difficulty scales they ranged from .87 to .90 Tests of General Educational
(median α = .89). These coefficients are, as the Development (GED Tests)
authors indicate, somewhat inflated because they
The GED tests, developed by the American Coun-
were computed on the same sample used for the
cil on Education, are used throughout the United
factor analysis. The individual scales were inter-
States and Canada to award high-school level
correlated with one another, some to a substan-
equivalency credentials to adults who did not
tial degree. For example, Keep Friends and Make
graduate from high school. Specific score require-
Friends correlated .68; Siblings and Work, on the
ments are set by each state or Canadian province.
other hand, correlated .16.
More than 700,000 adults take the GED tests
The authors report a number of interest-
each year, and nearly 470,000 are awarded a high
ing findings. For example, situations involving
school equivalency diploma (Whitney, Malizio,
parents and siblings were the most frequently
& Patience, 1986).
occurring, whereas situations involving prob-
lem behavior and work were relatively infre-
Description. The GED battery contains five tests:
quent events. Situations involving parents and
current friends were seen as the most difficult, (1) writing skills; (2) social studies; (3) science;
and problem behavior and work as least diffi- (4) reading skills; and (5) mathematics. Because
cult. Two of the scales, Problem Behavior and of the nature of this battery, new forms of each test
Work showed significant gender differences as are continually produced, so questions about reli-
to frequency, with male adolescents rating these ability and validity can either focus on a specific
situations as more common. Three scales – Par- form, or more usefully, on the test itself regardless
ents, School, and Keep Friends – showed sig- of form.
nificant gender differences as to difficulty, with
females rating these problem situations as more Reliability. Whitney, Malizio, and Patience
difficult. (1986) present K-R coefficients for different
In a second study, the authors revised the scor- forms used in 1980, for both large samples of U.S.
ing procedure slightly and combined the fre- graduating high-school seniors, and for samples
quency and difficulty ratings into one score, and of GED examinees. The results are summarized in
found that this seemed to be a more accurate Table 13.1. These coefficients reflect a high degree
assessment of adolescents’ social performance. of reliability for the various forms and for both
In a third study, they assessed the concurrent samples. As you see, the K-R 20 coefficients are
validity of their scale, by comparing scale scores slightly lower for the GED examinees, reflecting
with peer nominations of most liked and least their lesser variability in their scores.
liked, with teachers’ ratings of peer acceptance, During the 1980 standardization study, sam-
and with a standardized measure of parent- ples of high-school seniors took two different
adolescent conflict. forms of the GED tests. This allowed the com-
In this third study, reliability was also assessed. putation of alternate forms reliability; obtained
Internal consistency yielded a median α of .81, coefficients ranged from .76 to .89, with the
but for the Problem Behavior scale, α was .58. majority of the coefficients in the low to mid .80s,
Test-retest reliability over a 2-week period ranged again suggesting substantial reliability. These
from .72 for Work to .86 for School, with a
median coefficient of .78. Adolescents who were Table 13–1. Range of K-R 20 Coefficients
seen by peers and teachers as popular were com- for the GED Tests (Whitney, Malizio, &
pared with those seen as unpopular. Unpop- Patience, 1986)
ular adolescents scored significantly higher on High-School
Parents, School, and Make Friends, and generally GED test seniors GED examinees
endorsed more overall problem situations. Simi- Writing skills .93 to .94 .88 to .94
larly, adolescents who scored higher on the mea- Social studies .91 to .93 .86 to .94
sure of parent-adolescent conflict scored higher Science .90 to .93 .86 to .93
Reading skills .89 to .92 .85 to .93
on five of the seven scales. These findings support
Mathematics .90 to .93 .81 to .92
the construct validity of this questionnaire.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 333

coefficients are somewhat lower than the K-R 20 (different form) than with another GED test, and
coefficients because different forms of the tests that seems to be the case.
were used and were administered on different There is a substantial body of literature avail-
days, thus introducing two additional sources of able, although most of it is in American Council
error variation. on Education publications, indicating that cor-
relations between GED tests and tests in other
batteries designed to assess similar or identical
Content validity. Because the GED tests are
variables are quite substantial and generally sup-
intended to measure “the major and lasting out-
port the validity of this battery.
comes of a high school program of study,” the
authors argue that content validity is of great-
Predictive validity. Predictive validity is some-
est importance. You recall that content validity
what more difficult to document in this case
is carried out primarily by logical analyses of
because the test battery is not designed to mea-
test items, and that ordinarily it is built into the
sure a specific variable for which specific predic-
test rather than analyzed afterwards. Items from
tions can be made, but rather assesses the equiva-
the GED tests are written by teams of experi-
lency of a broad educational procedure, i.e., high
enced educators, and test content is carefully ana-
school. One aspect of predictive validity can be
lyzed by other teams of curriculum specialists and
found in nationwide follow-up studies of GED
related professionals.
graduates who indicate that passing the GED
tests led to improvements in pay, acceptance into
Concurrent validity. As indicated above, the training programs, and other benefits. Of course,
GED tests are also administered to national such data is somewhat suspect because we don’t
samples of high-school graduating seniors, even know to what degree the responses were elicited
though the tests are intended for adults who by the questions asked, or do we know what hap-
have left the high-school environment without pened to individuals who did not take the GED
obtaining the degree. One reason for doing this tests.
is to ascertain that the decision made on the
basis of the GED test scores is equivalent to the
The National Assessment of Educational
decision made by high schools, that is, the test
Progress (NAEP)
scores truly reflect a high-school equivalency per-
formance. Considering the typical GED scores The NAEP is a Congressionally mandated survey
required by most states, somewhere between 27% of American students’ educational achievement;
and 33% of currently graduating high-school it was first conducted in 1969, annually through
seniors would be considered to have failed the 1980, and biennially since then. The goal of the
GED tests. Thus, the standards used in the GED NAEP is to estimate educational achievement and
testing program are somewhat more stringent changes in that achievement over time, for Amer-
than those employed by high schools. ican students of specific ages, gender, and demo-
Another aspect of concurrent validity concerns graphic characteristics (E. G. Johnson, 1992).
the correlations between tests. The GED tests do The items used by the NAEP are similar to
correlate substantially with each other: correla- those in teacher-made classroom tests and stan-
tion coefficients range from .63 for the Mathe- dardized achievement tests. However, such tests
matics vs. Writing Skills to .82 for Science ver- are designed to measure the proficiencies of an
sus Social Studies. All of the tests require that individual. The NAEP is designed to measure the
the examinee read and interpret written mate- distribution of proficiencies in student popula-
rial, and four of the tests involve the use of written tions. Thus, not every student is tested, nor are
passages followed by a series of questions. There- the tested students presented with the same items.
fore, it is not surprising that the tests correlate The NAEP covers a wide range of school-
significantly with each other. We would hope that subject areas such as reading, mathematics, writ-
these correlation coefficients are somewhat lower ing, science, social studies, music, and computer
than the parallel forms reliability – i.e., the Writ- competence. Students are tested at ages 9, 13, and
ing Skills test should correlate more with itself 17 corresponding somewhat to grades 4, 8, and
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

334 Part Four. The Settings

12. Some subject areas, such as reading and math- reliability between the essay and multiple-choice
ematics, are assessed every 2 years, while other sections for these exams. For example, the K-
areas are assessed every 4 or 6 years. For sample R 20 reliability for the American History exam,
items and an overview of the development of the multiple-choice section was .90 and .89 for two
NAEP, see Mullis (1992). yearly samples vs. a coefficient alpha of .54 for
The items or exercises for each content area the essay section. The correlations between the
were developed using a consensus approach, multiple-choice and essay sections were .48 and
where large committees of experts including con- .53; similar findings were obtained on all other
cerned citizens specified the objectives for each AP exams. In other words, scores on the essay sec-
content area. The test results are then analyzed tions are not reliable and do not correlate highly
using item-response theory. with the scores on the multiple-choice sections.
The NAEP was conceived as an information What about correlations with GPA? Multiple-
system that would yield indicators of educational choice scores from the American History and
progress, just as the Consumer Price Index is Biology examinations were more highly corre-
one indicator of economic health (R. L. Linn & lated with freshman GPA than were essay scores.
Dunbar, 1992). Rather than report global test For the European History and English examina-
scores, analyses of the NAEP are based on the tions, the differences between correlation coef-
individual items or exercises; thus the basic scor- ficients from multiple-choice sections and essay
ing unit is the percentage of test takers who suc- sections were not significant, and a composite of
cessfully complete a particular exercise. the two sections was a better predictor of GPA
than either section by itself.
Essay vs. Multiple Choice
Tests composed of multiple-choice items are
ADMISSION INTO COLLEGE
quite often vilified with the arguments that
they merely test factual knowledge rather than
The Scholastic Aptitude Test (SAT)
the ability to think, to produce arguments, to
organize factual material, and so on. From a Historical note. In 1900, 12 colleges and univer-
psychometric and a practical point of view, sities joined the College Entrance Examination
multiple-choice items are preferable because they Board, an organization created by an association
are easy to score by machine, do not involve of colleges, to bring some order into the chaotic
the measurement error created by subjective world of admission examinations, which up to
scoring, are more reliable and more amenable that time varied substantially from institution to
to statistical analyses. In addition, well-written institution (M. R. Linn, 1993). The first exami-
multiple-choice items can indeed assess the more nations prepared by this Board were essay type,
complicated and desirable aspects of cognitive but in 1926 the Board presented the Scholastic
functioning. Aptitude Test (SAT). Some 8,040 candidates were
Bridgeman and Lewis (1994) addressed this tested in that year, versus the 1 million plus who
issue by looking at Advanced Placement (AP) are tested currently. At first, only a total score was
examinations in the fields of American History, provided, but subsequent analyses indicated that
European History, English, and Biology, which verbal and mathematical scores did not correlate
contain both multiple-choice and essay sections. highly, and so the two scores were kept separate.
The AP examinations are taken by high-school The SAT was actually intended to fight dis-
students who are seeking college credit or place- crimination in the college admission process. At
ment into advanced college courses. Thus, con- that time, selective colleges such as the Ivy League
tent validity of these exams is of importance. schools had relationships with specific college
Bridgeman and Lewis (1994) tabulated the AP preparatory academies, so that students from
scores for a nationwide sample of more than those schools could be admitted readily without
7,000 students from 32 colleges, and compared too much regard for their academic credentials,
these to GPA obtained in the same topic-area but with emphasis on their social and ethnic char-
courses. It is interesting to note the differences in acteristics. By introducing a test such as the SAT
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 335

into the admission process, equal opportunity is on both the correct and incorrect answers, and
was presented to all. the emphasis is on diagnostic utility, i.e., the pat-
From the early years, the standard score sys- tern of responses is related to a diagnosis. Thus,
tem with a mean of 500 and SD of 100 was used. the SAT is not intended to be a diagnostic tool.
However, this did not permit comparison of one
year’s candidates with those of the following year Revisions. In one sense, the SAT is continually
because each test form contained different items, being revised because new forms are generated
but the mean was always equated to 500. It was for every test administration. Revision in the
decided to use the candidates who took the April sense of major changes does not occur too fre-
1941 test as the normative sample. This was made quently. The SAT was revised in 1994 and these
possible because each subsequent form does con- revisions include longer reading comprehension
tain some identical items to the prior form, and passages with questions that ask students to focus
thus statistically, it is possible to equate the scores more on the context of the reading; math ques-
on such forms. tions that require students to generate their own
Prior to 1958, test scores on the SAT were not answers; and more time per question on the test,
given to the students, and so it was quite easy for so there is less time pressure.
admission officers to indicate that a student had A number of changes were also made on the
been rejected because of low SAT scores, whereas Achievement Tests, such as the inclusion of a 20-
in fact the rejection might be based on marginal minute essay in the English Achievement Test.
high-school grades or other aspects. These achievement tests are now called SAT IIs,
In terms of college admissions, the Scholas- to emphasize that they supplement the SAT, now
tic Aptitude Test clearly dominates the field, fol- called the SAT I.
lowed by the Academic Tests of the American Part of the 1995 revision involved “recenter-
College Testing Program, known as the ACT. ing” the scores. Theoretically, although the mean
for the SAT is supposed to be 500, for 1993
Description. The content of the SAT has changed college-bound seniors the verbal mean was 424
very little since its inception. From the beginning and the math mean was 478. Such recentering
there have been two major content areas: verbal involves statistical calculations that increase the
and quantitative. The current SAT Verbal section scores but do not change the percentile ranking –
consists of four item types: antonyms, analogies, very much like adding X number of points to
sentence completion, and reading comprehen- each score to make the average come out to be
sion. The antonyms have been used since 1926 500 (Educational Testing Service, 1994).
and the other three since the mid 1940s (Bejar, More recent revisions include essay portions
Embretson, & Mayer, 1987). and different scoring procedures.
The SAT quantitative (or SAT-M) appears
to be unidimensional, so there is little empir- Multiple-choice items. Perhaps more than any
ical justification for dividing the SAT-M score other test, the SAT represents the stereotypical
into subscores. The SAT-V however, seems to use of multiple-choice items, which are praised
be composed of two distinct but highly related by test experts, but criticized by critics. Multiple-
dimensions: a reading dimension and a vocabu- choice items are advantageous not only because
lary dimension (Dorans & Lawrence, 1987). they can be scored by machine and thus are rel-
Each of the 85 verbal and 60 mathemati- atively inexpensive, but because they permit a
cal items is a five-option multiple-choice task, much wider sampling of the subject matter and
scored by a formula intended to offset any gain in the student’s abilities. In the same amount of
score that might be expected from blind guessing time, we can ask a student to answer four or five
(R. M. Kaplan, 1982). short essay questions on American History, or
The SAT can be described as an outcome- we can administer 100+ multiple-choice items.
oriented test, as opposed to a process-oriented A test made up of multiple-choice items can be
test. In an outcome oriented test, what matters is planned carefully by specifying what is to be cov-
the total score that reflects the number of correct ered in terms of content and of abilities. Thus,
answers. With a process-oriented test, the focus the content validity of multiple-choice tests is
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

336 Part Four. The Settings

usually substantially higher than that of essay lower SAT-V scores. As a rule, women score better
tests. Finally, multiple-choice items can be stud- on verbal tests than men. Therefore, the impli-
ied systematically to determine which items work cation is that there may be something “unusual”
as designed. about the SAT.
There is a substantial body of literature on gen-
Test sophistication. One of the concerns that test der differences in mathematical ability indicating
developers have is the degree of test sophistica- that female students outperform male students
tion or test-taking familiarity a person may have. on measures of mathematical ability at the ele-
We want a test score to reflect a person’s standing mentary and middle-school levels, but male stu-
on the variable tested, and not the degree of famil- dents outperform female students at the high-
iarity the subject has with multiple-choice items. school and college levels (e.g., L. R. Aiken, 1987).
This is a particular concern with tests such as the There is also evidence that the gender difference
SAT where, on the one hand, sophisticated exam- in a variety of math tests has become smaller over
inees should not “beat the test” and, on the other time, so the persistent gender difference on the
hand, naive examinees should not be penalized SAT is quite puzzling (Byrnes & Takahira, 1993;
(D. E. Powers & Alderman, 1983). Hyde, Fennema, & Lamon, 1990).
Two approaches are usually taken to remove These test differences do not appear to be
sophistication as an extraneous variable. The first related to such aspects as choice of a college major,
is to make sure that the test items and directions different career interests, or different courses
are easy to read, not complicated, and do not have taken in high school. One suggested possibility is
extraneous clues that can help test-wise exami- that the gender gap is due to the increasing num-
nees. The second approach is to make sure that ber of women, particularly minority women, who
all examinees are equally sophisticated by teach- are taking the test.
ing all of them the precepts of good test-taking As a predictor of college success, the SAT
strategies (e.g., budget your time; don’t spend too underpredicts the performance of women (their
much time on any one item, etc.), and by pro- predicted GPA is lower than their actual GPA)
viding practice examples of test items. In 1978, and overpredicts that of men (their predicted
the College Board introduced a booklet called GPA is higher than their actual GPA, Clark &
Taking the SAT, designed to familiarize students Grandy, 1984). Sheehan and Gray (1991) com-
with the SAT and provide all candidates with the pared results on the SAT with the results on
same basic information about the test. Before the a standardized algebra exam taken by entering
booklet was made available to all students, pre- freshmen and transfer students at American Uni-
publication copies were sent to a random sample versity, a private university in Washington, D.C.
of SAT candidates. A comparison was then made The mean combined SAT score for women stu-
of the effects of the booklet on test scores. The dents was 1,096 and for men students 1,132. Their
results indicated that, although the booklet was mean college GPA however, was 3.01 for women
useful, reading it had a minimal effect on subse- and 2.89 for men. Scores on the algebra test had a
quent test scores. higher correlation with college GPA than did the
SAT scores, but all the correlations were lower
Gender gap. There is currently an average dif- than .30. The results of this study indicated no
ference of about 59 points on the combined score gender gap on the algebra test, but a gender dif-
on the SAT between men and women, in favor of ference on the combined SAT and on the GPA.
men. This is somewhat peculiar because the out- The authors felt that this gender difference was
come that the SAT is supposed to predict, college- not due to the hypothesis that women choose
freshman year GPA, is consistently slightly higher less difficult majors, or that more women with
for women than for men. This gender gap is pri- fewer economic and intellectual advantages take
marily made up of lower mean scores on the math the SAT, or that women have more difficulty
section for women; since the 1960s male students with multiple-choice tests. They also concluded
have scored an average of 46 points higher than that the SAT is not a valid predictor of academic
female students (College Entrance Examination achievement, and that a better approach might
Board, 1988). But since 1972 it is also reflected in be to use achievement tests.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 337

Byrnes and Takahira (1993) suggested that the Table 13–2. Correlations between
gender difference on the SAT reflects the fact that Predicted GPA and Obtained GPA
male students perform certain cognitive opera- (McCormack, 1983)
tions more effectively than female students. In an Group Year 1 Year 2
experimental study of high-school students, they
White .38 .42
tested and found support for the hypothesis that Asian .57 .54
performance on SAT items was due to differences Hispanic .35 .52
in cognitive skills such as the ability to define the Black .36 .40
problem and avoid misleading alternatives. Indian .40 .42

Minority bias. The use of standardized tests in small and the variation too large to obtain signif-
the selection of applicants for admission to col- icance.
lege theoretically benefits both the institution McCormack (1983) studied the issue of minor-
and the individual. By identifying students who ity bias on the SAT by analyzing the SAT scores
potentially will fail, the institution is less likely and scholastic records of students at a large state
to waste its resources, and so is the individual. university in California. He first developed a
Both false negatives (a selected individual who regression equation to predict first semester col-
fails) and false positives (a potentially successful lege GPA for white students only. The equation
student who is rejected) are of concern, however. looked like this:
Thorndike (1971) argued that if members of
a minority group tend, on the average, to score College GPA = .73555 high-school GPA
lower on the predictor (i.e., the SAT) than on + .08050 SAT-Total/100
the criterion (i.e., GPA) as compared with the − .67447
majority, then there will be a relatively higher
incidence of false negatives, if the test is used as Note that high-school GPA is a better predic-
a selection device. R. D. Goldman and Widawski tor than the SAT-Total as indicated by its larger
(1976) analyzed student scores at four universi- weight. The last number in the equation is sim-
ties and found that the use of the SAT in selection ply a mathematical “correction” to make the two
of black and Mexican-American students shifted sides of the equation equal.
the number of errors from the false positive cat- For each person in the sample, we can com-
egory to the false negative category. pute the expected GPA using this equation and
A number of studies have shown that minority compare the expected GPA to the actual obtained
or disadvantaged students, when admitted to col- GPA. To the degree that the prediction is accurate,
lege, can do quite well academically, despite rela- the two GPAs should be equal, i.e., the average
tively low Scholastic Aptitude Test scores. These error should be zero. If the equation systemati-
findings raise issues about the validity of such cally underpredicts, that is, if the predicted GPA
scores for minority students, and a number of is lower than the actual GPA, then the average
studies have shown that high-school achievement error should be positive; conversely, overpredic-
is a better predictor of college achievement than tion should result in a negative error.
are test scores. In general, high-school GPA corre- How well did the equation work for white stu-
lates about .30 with college GPA, with test scores dents and for minority students? Table 13.2 gives
adding little to the prediction. the correlations between the predicted GPA and
Houston (1980) studied a small sample (n = the obtained GPA for two cohorts of freshmen.
61) of black students given “special” admission to Note that the regression equation seems to
a university. A comparison of those students who work relatively well in both cohorts for all eth-
had graduated within eight semesters vs. those nic groups, including minority groups, and espe-
who had been dismissed for academic reasons cially for Asian students. What about the SAT
indicated significant differences in high-school taken individually and high-school GPA taken
rank, college GPA, and SAT-M scores, but not on individually? How did these variables corre-
SAT-V. Although the difference was in the “right” late with actual GPA? Table 13.3 provides the
direction of 371 vs. 339, the sample size was too answers.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

338 Part Four. The Settings

Table 13–3. Correlations between SAT and The authors concluded that if the SAT is used
High-School GPA with Actual College GPA to predict the grades of Mexican-American stu-
(McCormack, 1983) dents, using a regression equation developed
SAT High-school GPA on Anglos, the result will be overprediction of
grades, that is, the students will do less well
Group Year 1 Year 2 Year 1 Year 2
than what is predicted. If however, the equa-
White .22 .24 .35 .37 tion is based on Mexican-American norms, the
Asian .26 .53 .56 .45
Hispanic .16 .44 .34 .47
predictive validity of the SAT will be similar
Black .29 .39 .29 .37 for Mexican-American students as for Anglo
Indian .04 .23 .40 .32 students.

Utility or validity? One of the major investiga-


Note that with the possible exception of black tors in the area of using aptitude and achieve-
students, high-school GPA is a better predictor ment tests in college admissions has been James
than SAT scores (keep in mind that high-school Crouse, who has recommended that colleges
GPA summarizes 4 years of behavior, whereas the abandon the SAT and use standardized achieve-
SAT reflects a few hours). Note also that there is a ment tests, instead, to select incoming students
fair amount of variability from one cohort to the (Gottfredson & Crouse, 1986).
other, but generally the results of the regression Part of Crouse’s argument is that the focus
equation are better than either variable by itself. should be on the utility of a test rather than on
A statistical analysis did in fact indicate a small its validity. Even if a test is unbiased and pre-
overprediction for minority groups, except for dicts desired criteria well, it does not necessar-
American Indians. ily mean that it should be used. Crouse argues
J. Fleming and Garcia (1998) studied black stu- that the practical benefits of the SAT for col-
dents attending predominantly black and pre- lege admissions are minimal because the SAT
dominantly white colleges. Although their find- provides predictions of success that are largely
ings are too complex to summarize here, they redundant with those made from high-school
suggest that any racial differences in the predic- grades alone. SAT scores and high-school rank
tive validity of the SAT may be more a function are moderately correlated (in the .40 to .50 range)
of adjustment problems than inherent bias in the with each other and with educational outcomes
SAT. such as college GPA, so that outcomes predicted
from high-school rank alone have a part-whole
correlation of at least .80 with outcomes predicted
The SAT and Mexican-Americans. Goldman from high-school rank plus SAT scores.
and Richards (1974) analyzed SAT scores and aca- Crouse argues that achievement tests should be
demic performance, as defined by second quarter substituted for the SAT in that such tests are “no
college GPA, for a sample of Mexican-American worse” than the SAT in predicting academic suc-
and Anglo-American students attending a large cess. Their advantage is that they would promote
California university. On the SAT-V, SAT-M, and “diligence” in high school. The SAT is seen as a
GPA, the Anglo students scored higher. For the measure of how smart a person is, presumably,
Mexican-Americans, SAT-V correlated .33 with in part, an innate characteristic. The achievement
GPA, and SAT-M correlated .12 (corresponding tests, however, would reflect how hard a person is
correlations for the Anglo group were .40 and willing to work (see Crouse, 1985, and a rebuttal
.37). A regression equation developed on the by Hanford, 1985).
Anglo group correlated .44 with GPA. When this
regression equation was applied to the Mexican- Aptitude vs. achievement. In addition to the
American sample, there was an overprediction SAT, some colleges, particularly the more selec-
of GPA; the actual average GPA was 2.28, but the tive ones, require applicants to present scores on
predicted GPA was 2.66. The entire study was the CEEB achievement tests. There are a num-
replicated on a subsequent larger sample with ber of these, and often the candidate has some
similar results. degree of choice in which ones to take, so that
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 339

not all candidates present the same achievement Table 13–4. Correlations of Four Predictors
tests. with College GPA (Halpin et al., 1981)
Schrader (1971) assessed the validity of these Predictor Correlation
achievement tests and found, by adding such
High-school GPA .49
test scores to the information already provided
SAT .42
by high-school GPA and SAT scores, the predic- Calif. Achievement Tests .38
tion of college grades increased by about .05 for ACT .37
women and .03 for men. K. M. Wilson (1974)
studied several liberal arts colleges for women. In
correlate with freshman college GPA? Table 13.4
these colleges, the combination of SAT-V + SAT-
provides the answer. When high-school GPA was
M correlated with first-year college grades from
combined with each of the test scores individ-
.13 to .53, median of about .26. By adding high-
ually, the correlation with college GPA went up
school rank, the correlations ranged from .23 to
to .53. The authors concluded that high-school
.59, with an average increase of about .12. Adding
GPA was a better predictor of college grades than
achievement test scores (an average of whatever
either the ACT, the SAT, or the CAT. Combin-
the student had taken), the correlations ranged
ing high-school GPA with any of the test mea-
from .28 to .60, with an average increase of about
sures increased the predictive efficiency about
.07.
18.5%, with basically no differences between
Using a slightly different statistical analysis
tests. Therefore, the authors concluded that the
however, K. M. Wilson (1974) was able to show
CAT could be used in lieu of the SAT or the
that knowing the high-school rank and the
ACT.
achievement test scores, the SAT scores did not
improve the prediction of college GPA. In fact, he
Intelligence vs. aptitude. Feingold (1983) asked
argued that the achievement tests overall average
is a more valid predictor of college grades than an interesting question: Are measures of intelli-
the SAT. Baron and Norman (1992) studied close gence, specifically the information and vocabu-
to 4,000 students who entered the University of lary subtests of the WAIS, better predictors of
Pennsylvania. In general, they found that both college achievement than are tests such as the
high-school class rank and average achievement- SAT? He was able to locate four relevant studies
test scores added significantly to the overall pre- and summarized the results, given in Table 13.5.
diction of cumulative GPA, but SAT scores did Note that in the first two studies, the WAIS sub-
not. tests are better predictors of college GPA than
Are aptitude/intelligence tests different from are achievement-test scores. In one sense, this
achievement tests? Kelley (1927) argued that they is not at all surprising. Academic achievement
are not because such measures correlate sub- is a function of the broad intellectual capabil-
stantially with each other. The argument is still ities one has, rather than a specific degree of
unresolved. knowledge.

High-school achievement tests. Should high- Decline in SAT scores. Because the SAT is used,
school achievement test results be used to or misused, to somehow assess the state of our
predict college GPA? Because these tests are
routinely given in high school, and most are Table 13–5. Correlations with College GPA
nationally normed tests such as the California (Feingold, 1983)
Achievement Tests, we might omit the SAT alto- Achievement WAIS

gether, if we can show as a first step, that high- test information Vocabulary
school achievement test scores do indeed corre- Study 1 .38 .48 .46
late with collegiate GPA. G. Halpin, G. Halpin, Study 2 .30 .43 .38
and Schaer (1981) studied more than 1,400 col- Study 3 .46 – .45
Study 4 .25 .19 .25
lege freshmen who had taken either the SAT or

the ACT, and while in high school had taken the Note: Different achievement tests were used in the dif-
ferent studies; study 3 used the SAT.
California Achievement Tests. How did these tests
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

340 Part Four. The Settings

educational system, the observed decline in SAT Coaching could allow a person to do their best
scores from year to year has been the focus of rather than their typical. Because such coach-
much debate and concern. A substantial num- ing would not be available to all, the validity of
ber of hypotheses have been advanced to account the test would suffer; and (3) Finally, if coaching
for the decline in SAT scores; Wharton (1977) affects test performance on a test that suppos-
listed 79. These hypotheses cover such reasons edly measures stable traits, then again validity is
as inadequate teacher training, changes in family compromised.
values, growing anti-intellectualism, food addi- Coaching can cover a variety of procedures
tives, and changing family patterns. Zajonc and and goals. It can aim at increasing confidence or
Bargh (1980) postulated that the decline might decreasing anxiety, or it can teach specific test-
be due to changes in family configuration, specif- taking strategies or skills. It can involve short-
ically birth order and family size, since the U.S. term cramming or long-term instruction.
birthrate increased steadily from the late 1940s to Coaching companies often suggest that their
the early 1960s; however, the data they collected services can increase retest scores by a minimum
did not support such a hypothesis. Although a number of points. Increases in test scores on
definitive answer cannot be given, the observed retest however, can be due to practice effects, real
decline seems to reflect democracy at work: Each growth in abilities over the ensuing time period,
year the number of students who apply to college or measurement error. Simply retaking the SAT
are less elite and more heterogeneous. improves test scores by about 15 points on the ver-
bal portion, and about 12 points on the math por-
Coaching. Can special preparation, i.e., coach- tion. Some very limited evidence suggests a yearly
ing, have a significant impact on SAT test scores? average improvement of about 50 points (D. E.
A large number of companies provide coaching Powers, 1993). Measurement error can increase
services to a substantial group of paying students or decrease scores, and Powers estimates that typ-
each year. D. E. Powers (1993) indicates that a ically 1 in 25 SAT takers will gain 100 or more total
positive answer would have three major implica- points, and about 1 in 110 will lose 100 or more
tions. First, if coaching is effective but not rea- points in retesting.
sonably available to all test takers, then some A substantial number of studies have looked
test takers may have an unfair advantage. Sec- at the question of coaching on the SAT and have
ond, if such short-term preparation that essen- come up with a variety of conclusions. Numerous
tially emphasizes test-taking strategies is effec- study findings suggest negligible gains for stu-
tive, then the validity of the test as an index of dents who do take such preparatory courses (e.g.,
general academic ability is called into question. Kulik, Bangert-Drowns, & Kulik, 1984; Messick
Third, because such special preparation through & Jungeblut, 1981). These studies have also been
commercial services can be quite expensive, it analyzed through meta-analysis, and D. E. Powers
may detract from students’ participation in other (1993) summarizes the results of these meta-
worthwhile academic activities. analyses. Overall, for the typical coaching pro-
The term coaching may actually subsume three gram, the average increase in SAT test scores is
somewhat different types of activities. At the most about 15 to 25 points each on the verbal and
superficial level, coaching means giving subjects on the mathematics sections. More specifically,
a test-taking orientation, that is making sure that Powers indicates the following conclusions: (1)
the subject is familiar with the general procedures The effects of coaching are somewhat greater
involved in taking a particular test. At a second for the mathematics than for the verbal section;
level, we have the usual procedure which is to (2) Longer coaching programs yield somewhat
practice on items that are similar to those in the greater effects than do shorter ones, but dimin-
test. At a third level, coaching involves teaching ishing returns set in rather quickly, e.g., doubling
broadly applicable cognitive skills. the effort does not double the effect; (3) More
N. Cole (1982) suggested that coaching can rigorous studies, in which possible confounding
affect the validity of a test in three ways: (1) results are controlled, yield substantially smaller
Coaching could increase a person’s score above effects – estimated to be 9 points for the SAT-V
their “true” level, thus invalidate the test; (2) and 19 points for the SAT-M; and (4) The average
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 341

effect of coaching for a variety of other aptitude Reliability. Test-retest, internal consistency, and
tests is estimated to be nearly three times the aver- alternative form reliability coefficients for the
age effect for the SAT. SAT range from the high .80s to the low .90s;
A typical study is that by Alderman and P. E. the KR-20 reliability for the SAT-V is about .91
Powers (1980) who found an average gain of 8 and for the SAT-M is about .92.
points due to coaching; or that of Smyth (1989)
who found that high-school students who had Validity. Most of the validity information is pre-
taken some formal preparation for the SAT scored dictive validity, and most consists of correlations
6 points higher on the verbal section and 32 of SAT scores with first-year college GPA. These
points higher on the math than their peers who coefficients vary widely depending on a num-
had not taken such courses. However, the analyses ber of variables, such as major and institution,
showed that the scores of students who took the with coefficients ranging from the .10s to the mid
SAT a second or third time tend to improve signif- .60s, but with most studies reporting a correla-
icantly, and that coached students do tend to take tion near .40 between SAT scores and first-year
the SAT more than once. In general such coach- college GPA.
ing shows a negligible impact on verbal scores
and a relatively small improvement on math (for Validity generalization. Traditionally, research
an interesting review of some of the claims made on admission testing has emphasized the results
by coaching companies see Smyth, 1990). of local validity studies, that is, using data
from individual institutions. The assumption was
made that validity differences from one study to
The criterion: First year GPA. The SAT predicts another reflect the unique characteristics of dif-
about 10% to 15% of the observed variance in ferent institutions and of the different applicants
first-year college grades, and so we might ask they attract.
about the remaining 85% to 90%. Wainer (1993) The approach called validity generalization (see
suggests that part of the problem is that the cri- Chapter 3) has in fact shown that much of
terion is neither well defined nor all that impor- the variation in results from study to study is
tant – we ought to be more interested in pre- due to statistical artifacts, especially error from
dicting who will be a good engineer or a good the use of small samples, and institutional differ-
social worker. One major point to keep in mind ences in such things as how reliable the criterion
is that college grades are a very fallible criterion, is. Boldt (1986) studied three national samples
and grading standards vary substantially across of students who had taken the SAT, students who
different majors and different institutions (e.g., had applied to a particular college, and students
R. D. Goldman & Slaughter, 1976). who were admitted to a college. These samples
Because the SAT was specifically designed to were quite large – from 65,000 to almost 350,000.
predict first-year college grades, most studies use Boldt (1986) found that the internal reliability of
that as the criterion. A smaller number of stud- the SAT ranged from .90 to .92 for both the verbal
ies focus on total GPA. For example, J. French and the mathematics portions. The two sections
(1958), in a study of eight institutions, reported correlated .68 with each other. He also concluded
mean correlations of .43 and .27 between cumu- that the average validity for either SAT-V or SAT-
lative senior GPA and SAT-V and SAT-M scores. M, when various sources of error were statistically
Hills, Bush, & Klock (1964) found a multiple cor- controlled, was about .55.
relation of .66 between cumulative senior GPA
and a predictor composed of SAT-V, SAT-M, and Family income. In 1980, Ralph Nader, the well-
high-school GPA. Mauger and Kolmodin (1975) known consumer advocate criticized the SAT,
reported correlations of .52 and .43 between “ter- and more generally ETS, stating that the SAT was
minal” GPA and SAT-V and SAT-M scores, in a not a valid predictor of college success, and that
sample of students where only 32% had gradu- SAT scores reflected family income more than
ated. In a sample of graduating seniors, where scholastic potential (Nairn & Associates, 1980).
the range of grades was restricted, the correla- These arguments were in part fallacious and not
tions dropped to .26 and .22 respectively. supported by empirical findings (Kaplan, 1982).
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

342 Part Four. The Settings

Although there was a time when only the addition to GRE scores, should be used in making
wealthy and well-to-do could gain entrance decisions about specific candidates.
to major universities, most college admission The GRE has two primary limitations: (1) It
boards would argue that prospective students does not measure all the qualities that are impor-
are evaluated on the basis of merit rather than tant in predicting success in graduate study; and
economic background. SAT scores and parental (2) It is an inexact measure, that is, only score dif-
income do correlate in the .20s; the Nader report ferences between candidates that exceed the stan-
however, incorrectly used grouped data and dard error of measurement can serve as a reliable
found an r of .96 between the two variables. indicator of differences in knowledge or abilities.
Kaplan (1982) calculated the correlation between
mean SAT scores and mean GPA using grouped Widespread use. How widely used is the GRE?
data and found an r of .999! The point here is Oltman and Hartnett (1984) reported that of
that computing a correlation on grouped data is the more than 7,000 programs in the United
misleading; what is needed is not more politically States that offered the master’s degree, almost
misleading propaganda, but rigorous empirical 47% required the GRE General Test, and an addi-
analysis. tional 18% recommended or required the test for
specific programs. Of nearly 5,500 doctoral pro-
grams, some 63% required it, and an additional
Fair or unfair? Testing for admissions to educa-
24% recommended or required the test for spe-
tional institutions is a topic that generates a fair
cific programs. Wide variations in practice were
amount of heated controversy. Despite the horror
found among different academic areas; for exam-
stories one frequently hears about the unfairness
ple, 82% of biological sciences programs required
of such tests, and how Jane did not get into her
or recommended the GRE General Test vs. 52%
favorite university because of low test scores, sur-
in the fine and applied arts. A study of 1972 vs.
veys of test takers in fact show that most believe
1981 program requirements showed almost no
tests such as the SAT and the GRE are fair, that
overall change in the requiring of the GRE. A
their test scores did not influence where they
survey of a smaller number of departments indi-
applied, and that they believe that institutions
cated that the primary use of GRE scores seemed
pay more attention to grades and other academic
to be to compensate for otherwise weak appli-
aspects (Baird, 1987); that indeed seems to be the
cant credentials, and that, in making admission
case.
decisions, graduate departments weighted most
heavily undergraduate grades followed by letters
of recommendation, and then by GRE scores.
THE GRADUATE RECORD EXAMINATION

Purpose of the GRE. The GRE General Test and The General Test. The General Test yields sepa-
Subject Tests are designed to assess academic rate scores for verbal, quantitative, and analytical
knowledge and skills relevant to graduate study. abilities. The verbal portion uses four types of
The GRE is designed to offer a global measure of questions: antonyms (identify words opposite in
the verbal, quantitative, and analytical reasoning meaning), analogies, sentence completions, and
abilities acquired over a long period of time and reading-comprehension questions. These ques-
not related to a specific field of study. GRE scores tions cover a variety of content areas, such as arts
are to be used in conjunction with other informa- and humanities, physical and biological sciences,
tion to determine admissibility to graduate study. social studies, everyday life, and human relation-
GRE scores are suitable for selection of appli- ships and feelings.
cants for admission to graduate school, selec- The quantitative portion used three types of
tion of graduate fellowship applicants for award, questions: discrete quantitative questions that
selection of graduate teaching or research assis- test basic mathematical skills, data interpretation
tants, and for guidance and counseling for grad- items that use charts and graphs, and compar-
uate study. However, in the GRE Guide, a yearly isons that require the evaluation of the relative
publication of the GRE Board, specific mention size of two expressions or quantities. The math-
is made that multiple sources of information, in ematics that is required does not extend beyond
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 343

that usually covered in high school, and covers questions answered correctly minus one fourth
arithmetic, algebra, geometry, and data analysis. of the number of questions answered incorrectly.
The analytical portion contains questions On the General Test, scores can range from 200
related to analytical reasoning, logical reason- to 800, with a theoretical mean of 500 and SD of
ing, and analysis of explanations. These questions 100. The actual range of scaled scores for the sub-
attempt to assess such abilities as evaluating argu- ject tests varies from test to test. Theoretically, the
ments, recognizing assumptions, and generating scores should range from 200 to 800, but in fact
explanations. they range from 200 to 990. On the Biochemistry
Each form of the GRE consists of seven sec- Subject Test, the 99th percentile is equivalent to
tions of 30 minutes duration: two verbal, two a score of 760, while the first percentile is equiva-
quantitative, two analytical, and one for research lent to a score of 300. By contrast, on the Physics
purposes, such as trying out items that might be Subject Test, the 97th percentile is 990 (highest
included in future forms. score), and the first percentile is equivalent to
The three portions of the General Test are cor- 400.
related with each other. The average correlation Although all the Subject Tests use the same
between verbal and quantitative is .45, between scaling procedure, quite clearly scores on one
verbal and analytical is .65, and between quan- Subject Test cannot be directly compared with
titative and analytical is .66. By 1999, the GRE scores on another Subject Test. Not only do the
General test contained a new writing test and a tests measure different content, but the score dis-
new mathematical reasoning test. The five tests tributions are different, and the tests are taken by
were packaged in two different combinations of different examinees.
four tests each.
Practice. Descriptive booklets that contain sam-
Subject Tests. Currently there are Subject Tests ple and practice questions are available free from
in several areas ranging from biochemistry to ETS for both the General Test and for the Sub-
sociology. Each subject test yields a total score, ject Tests. In addition, older test forms for the
and seven subject tests yield subscores. For exam- Subject Tests are available for purchase. Finally,
ple, the biology test yields three subscores: (1) cel- a number of publications are available designed
lular and molecular biology; (2) organismal biol- to assist students in preparing for these exams,
ogy, and (3) ecology and evolution. Each subject thereby reducing any differences in test sophisti-
test differs in the number of questions. For exam- cation among candidates.
ple, the computer science test contains about
80 questions while the psychology test contains
Development. Questions for the General Test
about 220 questions (these numbers can change
are written primarily by ETS staff members, with
from form to form).
degrees and background appropriate to the sub-
area they are working on. There is also a technical
Scores on the GRE. Raw scores on the GRE are advisory committee composed of university pro-
changed to both standard scores (mean of 500 fessors specializing in fields such as mathematics,
and SD of 100) and percentiles. In the feedback linguistics, and psychological measurement; this
given to candidates, these scores are based on two committee advises the staff on various aspects,
normative groups. The first group is all exami- such as content specifications.
nees who took the test during the past 3 years; the Each item is reviewed by specialists both on
second is a subgroup of college seniors or recent the staff of ETS and outside ETS, and intensive
college graduates who have not yet enrolled in discussions are held. Once the items are judged
graduate school. In addition, percentile ranks on as appropriate, they are assembled into clusters
the General Test are available for specific fields and are included in an actual test administration.
such as biology and psychology. For the Gen- These questions do not contribute to the exam-
eral Test the raw score is the number of ques- inee’s scores, but the data is used to statistically
tions answered correctly. For all the subject tests analyze the items. Those items that perform sat-
(except music), the raw score is the number of isfactorily become part of a pool of items from
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

344 Part Four. The Settings

which new forms of the General Test are assem- the analytical portion. These coefficients might
bled. be somewhat inflated because speed plays a slight
The same basic procedure is also used for the role in the GRE. For the subject tests, the coef-
Subject Tests, except that the items are written ficients range from a low of .80 for a subtest in
primarily by experts in that field. For both Gen- geology, to .96 for literature in English, and for
eral Test and Subject Tests there is an extensive sociology, with most test coefficients in the low
and careful procedure that has evolved over the .90s and most subtest coefficients in the mid to
years, in which each item is scrutinized multiple high .80s.
times both by itself and in the context of other
items. Part of the review and of the subsequent Validity. Most studies report fairly low validity
statistical analyses is to ensure that items are not coefficients for the verbal and quantitative sec-
biased, sexist, or racist, or show unfair relation- tions, regardless of academic department and cri-
ship to minority group membership. terion used in measuring academic achievement.
Typical coefficients range from .20 through the
Multiple scores. Candidates can take the GRE low .30s. GRE Subject Tests tend to be better pre-
more than once and therefore may present mul- dictors of first-year GPA for specific departments
tiple scores in their application. Studies have than the GRE General Test, and GRE quantitative
shown that individuals who repeat the General scores tend to be better predictors in the math-
Test show on average a score gain of about 25 ematical and physical sciences. Validation stud-
to 30 points – but these individuals are a self- ies by ETS (1977) show median validity coeffi-
selected group who believe that repeating the test cients that range from .02 to .36 for the verbal
will increase their scores. ETS suggests that mul- section and from .06 to .32 for the quantitative
tiple scores can be averaged, or only the most section. Jaeger (1985) reported that the median
recent or highest score be used. predictive validity coefficients for the verbal score
ranged from .02 to .36 across nine major fields of
Computer version. In 1992, the GRE program study, while corresponding coefficients for the
began administering a computerized version of quantitative score ranged from .06 to .32. A 1988
the General Test. The computer version contains review (Staff, 1988) of 492 validity studies done
the same sections and methodology as the stan- between 1983 and 1988 indicated that correla-
dard version, but has different time limits, and a tions between GRE scores and graduate grades
minimal number of questions must be answered were low and ranged from .15 to .31. When GRE
for a score to be generated. In 1993, a computer scores were combined with undergraduate GPA,
adaptive form of the General Test was intro- the correlations rose somewhat to a high of .44.
duced. In an adaptive test, the selection of ques- Jaeger (1985) pointed to a time trend in the cor-
tions is tailored to an examinee’s ability level. relation between GRE and undergraduate GPA
Initially, the examinee is presented with ques- with graduate GPA. In studies done in the 1950s
tions of average difficulty; subsequent questions and 1960s the median r was about .45; in the
are then a function of the examinee’s pattern of 1970s it was .39, and in the 1980s it was about
responding. Correct answers lead to more dif- .35. The suggested explanation was restriction of
ficult questions; incorrect questions lead to eas- range due to grade inflation.
ier questions. Computer-delivered versions of the How does the GRE predict graduate first-year
General Test and of many Subject Tests are now GPA? The ETS Guide yields substantial informa-
offered at many test centers, and their number tion based on large national samples, and some
and availability will increase substantially; even- results are summarized in Table 13.6.
tually all GRE tests will be delivered by computer, Note that undergraduate GPA is a better pre-
and the candidate will receive the test scores at the dictor of graduate GPA than are the GRE sub-
close of the testing session. tests, taken either individually or in combina-
tion. However, the best prediction is obtained
Reliability. Reliability, as measured by the K-R, when both undergraduate GPA and GRE scores
is in the low .90s for both the verbal portion and are placed in a composite (essentially a regres-
the quantitative portion, and in the high .80s for sion equation). A similar pattern shows up when
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 345

Table 13–6. Correlations with coefficient is really an underestimate of the cor-


Graduate First-Year GPA relation for the entire group of applicants. There
Variable
are however, statistical formulae to estimate the
correlation for the entire sample. When this is
Undergraduate GPA .37
done, the obtained correlation coefficients are
GRE Verbal .30
GRE Quantitative .29 quite respectable – in the .35 to .70 range.
GRE Analytical .28
Composite of Verbal, .34 Restriction of range. Various issues are involved
Quantitative & Analytical in why the validity coefficients for the GRE are so
Composite of Above + .46
low, and why there is substantial variation from
Undergraduate GPA
study to study. One issue is that of restriction of
range. Cohn (1985) indicated that the GRE was
individual subject tests are considered. Table 13.7 “the best documented instrument of its type,”
gives some examples. and that restriction of range was a major consid-
Four examples are given below from more eration in the small validity coefficients obtained.
extensive data available in the GRE Guide. Note Dollinger (1989) analyzed the GRE scores for
that in each case, the correlation between scores 105 clinical psychology students admitted with-
on the Subject Test and the graduate GPA out regard for their GRE scores. Restriction of
is higher than the corresponding correlation range of GRE scores was not a problem in this
between undergraduate GPA and graduate GPA, sample; GRE-V scores ranged from 340 to 800
or between GRE Verbal and graduate GPA. As we and GRE-Q ranged from 260 to 770. Dollinger
discussed with the SAT, the evidence suggests that (1989) used two criteria: (1) number of failed
achievement tests are indeed better predictors. preliminary examinations, and (2) a composite
that incorporated the failure plus several criteria
Validity issues. There are two major issues on timely progression in the program and fac-
related to the validity of the GRE. One is the ulty judgment. All three GRE scores (V, Q, and
criterion problem – that is, how do you opera- Advanced) correlated significantly with both cri-
tionally define graduate school success? (see Har- teria, with correlation coefficients ranging from
nett & Willingham, 1980). A second problem is .33 to .46, and better than graduate GPA. How-
that of range restriction, or to put it another way, ever, when the data were analyzed for minor-
a very low selection ratio. In graduate psychology ity students, the coefficients dropped substan-
programs for example, the mean selection ratio tially, and the only significant result was that GRE
is .11 (only 11% of the applicants are actually Advanced Test scores did correlate significantly
accepted), although in fact for most programs with the criteria for both majority and minority
the selection ratio is under .09 (Chernyshenko & students.
Ones, 1999). Thus GRE validation studies typ- Huitema and Stein (1993) report an interest-
ically involve only students who were accepted ing study of 204 applicants to the Department
into graduate school – a highly restricted sample of Psychology at Western Michigan University,
as far as GRE scores. When a validity coefficient where GRE scores were required but ignored in
is computed between GRE scores and some crite- the admissions process. The authors show that
rion of graduate performance such as GPA, that the variation in GRE scores for those 138 appli-
cants who were accepted was essentially the same
as for the total pool of applicants, that is, there
Table 13–7. Correlations with Graduate was no restriction of range. Under these circum-
First-Year GPA
stances, GRE total scores correlated between .55
Undergraduate GRE and .70 with four criteria of graduate achieve-
Subject test r GPA Verbal ment, such as exam scores in Advanced Statistics
Biology .37 .33 .24 courses and faculty ratings. Correlations between
Chemistry .51 .36 .27 undergraduate GPA and the four criteria were all
Economics .43 .31 .22
nonsignificant. The authors argue that restriction
Psychology .37 .37 .29
of range in fact severely limits the validity of the
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

346 Part Four. The Settings

GRE. For example, the GRE Total correlated .63 Marston (1971) decided to examine post-PhD
with faculty ratings for the total sample, but only success by analyzing the GRE scores of 11 stu-
.24 for those whose GRE Total was at least 1200, dents and identifying their number of subsequent
a typical cutoff score. professional publications, which in psychology
Restriction of range can also refer to the crite- is considered evidence of professional achieve-
rion, which is often GPA. Graduate GPA is usually ment. For the clinical psychology students, the
expressed on a 4-point scale, and in many grad- correlation between combined GRE scores and
uate courses only As and Bs are awarded, and number of postdoctoral publications was −.05,
in some graduate programs “remedial” retest- and for nonclinical PhDs, it was .18. Additional
ing is allowed if the initial examination score is analyses generally supported the lack of relation-
below the A level. In fact, in the Huitema and ship between GRE scores and publication rates.
Stein (1993) study, GPA was not even considered Marston (1971) concluded that it was time to
because it was felt that assigned grades did not have a nationwide review of the effectiveness of
reflect the variation in the academic skills of the the GRE and to seek better alternatives.
students. House and J. J. Johnson (1993b) analyzed the
predictive validity of the GRE by dividing grad-
Validity in psychology. There have been many uate students into those enrolled in professional
studies of the predictive validity of the GRE with psychology areas, such as clinical and counseling
regard to graduate students in psychology. Typ- vs. those in experimental or general psychology.
ically the predicted criterion consists of grades, A regression analysis of GRE scores and under-
either overall or in specific courses, or examina- graduate GPA to predict whether the student had
tion performance (e.g., Boudreau, Killip, MacIn- completed the master’s degree or not, indicated
nis, et al., 1983; Federici & Schuerger, 1974; that GRE Verbal scores were the best predictor for
House, J. J. Johnson, & Tolone, 1987). professional psychology students, but the worst
Marston (1971) briefly reviewed the validity predictor for experimental or general students.
of the GRE. He reported that correlations of the GRE Quantitative scores were however, the best
GRE-V and/or the GRE-Q scores correlated with predictor for the experimental or general stu-
graduate school grades in Psychology from .23 dents (for a recent study see Sternberg & W. M.
to .64. Correlations with faculty ratings ranged Williams, 1997).
from .29 to .57. In one study covering seven dif-
ferent psychology departments, correlations with
the criterion of success versus failure in graduate Advanced Psychology Test. Although the pre-
school ranged from −.39 to −.55, with a median dictive validity of the GRE Verbal and Quantita-
r of about .18, and with the high correlation coef- tive sections have been studied extensively, there
ficient reflecting the contribution of undergrad- are substantially fewer studies on the Advanced
uate GPA as well. tests. In the area of Psychology, as an example,
Several studies have looked at degree com- Advanced Psychology Test scores have been sig-
pletion as the criterion. For example, Merenda nificant predictors of grades and performance
and Reilly (1971) found that GRE scores in com- on comprehensive examinations (e.g., Kirnan &
bination with undergraduate GPA, psychology Geisinger, 1981), but not of faculty ratings (e.g.,
courses GPA, and ratings of the quality of stu- Hackman, Wiggins, & Bass, 1970).
dents’ undergraduate institution, were able to House and J. J. Johnson (1993b), in the study
successfully discriminate among students who mentioned above, looked at GRE Advanced Psy-
earned their doctoral degrees without delay from chology Test scores for 293 graduate students in
those who earned the degree with some delays, master’s programs. For the entire sample, test
and those who failed to complete their degrees. scores were significantly correlated with grades
However, another study found that GRE scores (r = .41), but the correlation coefficients showed
alone were unable to differentiate between stu- substantial variation across program areas, from
dents who completed advanced degrees and those a low of .10 for clinical psychology students, to
who did not (J. R. Rawls, D. J. Rawls, & Harrison, .56 for counseling psychology students (see Kalat
1969). & Matlin, 2000 for an overview of this test).
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 347

Table 13–8. Correlations with Scores on a The GRE-Q did predict grades in quantitative
Master’s Comprehensive Exam (Kirnan & courses, and the GRE-V did predict comprehen-
Geisinger, 1981) sive exam performance. The authors suggested
Clinical Experimental that what is needed is not necessarily to throw out
Variable students students the GRE, but to focus on the criterion. That is,
GRE Verbal .44

.32
∗ we should define whether we are trying to predict

GRE Quantitative .31 .07 graduation, scientific productivity, or something

GRE Advanced Test – .35 .03 else, and operationalize such criteria.
Psychology
∗ ∗
Miller Analogies Test .42 .13 What criterion to use? How can successful per-
Undergraduate GPA .07 .05
formance be defined? Hartnett and Willingham

Statistically significant coefficients (1980) categorized three broad classes of crite-
rion measures: (1) traditional criteria such as
grades; (2) evidence of professional accomplish-
Alternatives to the GRE. Are there other tests ment such as publications; and (3) specially
that could be used in lieu of the GRE? Potentially developed criteria such as faculty ratings.
there are, but the only one of note that is used is Traditional criteria include variables such as
the Miller Analogies Test. grades, degree attainment, time to complete
Kirnan and Geisinger (1981) studied 114 grad- degree, performance on comprehensive exami-
uate students at a private university enrolled in nations, and quality of dissertation. Grades have
either clinical or experimental psychology. All been used more than any other criteria in studies
students had taken the GRE as well as the Miller of graduate-school success and the validity of the
prior to admission, and as part of their studies GRE. Grades are readily available and are com-
had taken a Master’s Comprehensive Exam. How mon to most institutions. Although grades reflect
did these variables correlate with the scores on the a variety of aspects, it is reasonable to treat them
master’s exam? Table 13.8 provides the answer. as reflective of an underlying dimension of “aca-
Note that the GRE Verbal does a commendable demic success.” On the negative side, the range of
job of predicting comprehensive exam scores. grades, particularly in graduate studies, is quite
The Miller also does well for the clinical students restricted, and grading standards vary substan-
but not for the experimental students. Under- tially from setting to setting.
graduate GPA does not correlate significantly Whether a student obtains a degree is a most
with exam scores. Whether these findings gener- important outcome of graduate studies, and
alize to other institutions needs to be investigated, many regard this as the best criterion. Clearly,
but for now we must conclude that the position however, students drop out of graduate school
of the GRE is not threatened by other tests. for many reasons that have nothing to do with
competence or academic skills. Time to degree
GPA as the criterion. A basic question is whether is another criterion used as a measure of suc-
first-year graduate GPA is the criterion measure cess in graduate school. Here, too, one can argue
that ought to be used. Perhaps, whether a person that speed of completion reflects a wide variety
obtains their degree or not is a more appropriate of circumstances that are unrelated to academic
criterion of success. E. L. Goldberg and Alliger competence (House & J. J. Johnson, 1993a).
(1992) undertook a metaanalysis of the litera- Part of graduate studies involves qualify-
ture, identifying 27 studies dealing with counsel- ing and/or comprehensive examinations, writ-
ing and/or psychology departments. Their anal- ten and/or oral. The nature and form of these
ysis indicated that the GRE Advanced Test in exams varies substantially across departments,
Psychology did correlate with graduate school with many departments never having defined
success measured by multiple criteria, but the precisely the nature and purpose of such exams.
typical validity coefficient was about .19; nei- The scoring of these exams is frequently a highly
ther the GRE-V nor the GRE-Q did as well. subjective matter, and the reliability associated
When graduate GPA was the criterion, the GRE with such a criterion can be easily questioned.
did not demonstrate adequate predictive validity. Dissertation quality presents other problems.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

348 Part Four. The Settings

The dissertation is basically evidence that the stu- The differences in GPA were somewhat small, but
dent is able to conduct scholarly research in a the differences on the GRE scores were more sub-
sound and competent manner. On the one hand, stantial. Regression equations were then com-
this represents a potentially useful criterion; on puted with the predictor being graduate GPA.
the other, there are a number of problems such For Hispanics, the regression equation correlated
as separating what portions reflect the student’s with graduate GPA only .19, and the only vari-
work vs. the mentor’s work. able that had some predictive power was under-
In terms of evidence of professional accom- graduate GPA. Essentially the same results were
plishment, there are a number of criteria that obtained with Anglo students, with the regres-
could be used, such as papers published or pre- sion equation correlating .27 with graduate GPA;
sentations at professional conferences. Such cri- again undergraduate GPA was the only signifi-
teria have a number of problems. First, they may cant variable. The authors concluded: (1) there is
not be routinely collected and thus may not be both an intercept bias (Anglos score higher on the
available for analysis. Second, such accomplish- GRE) and a slope bias (although the regression
ments may mirror a variety of factors other than equations were poor predictors for both groups,
professional competence. Finally, such criteria they were slightly worse for the Hispanic group)
are typically not normally distributed but are on the GRE; and (2) continued use of the GRE
highly positively skewed. for graduate-school selection is a questionable
Among the specially constructed criteria might practice.
be considered global faculty ratings and perfor-
mance work samples. Ratings are relatively easy
ENTRANCE INTO PROFESSIONAL
to obtain and provide a fairly convenient crite-
TRAINING
rion. At the same time, ratings are limited, may
show restriction of range, and are open to bias
The Medical College Admission Test
such as the “halo” effect, where ratings are influ-
(MCAT)
enced by the observer’s general impression of the
person being rated. With regard to work samples, Purpose. The purpose of the Medical Col-
graduate students are being trained both for spe- lege Admission Test (MCAT) is to “measure
cific tasks germane to their discipline, such as achievement levels and the expected prerequi-
analysis of water contamination, and for more sites that are generally relevant to the practice
generic tasks such as research, scholarly work, of medicine” (Association of American Medical
and teaching. Theoretically, at least, one could Colleges, 1977).
develop work samples that could be used as cri-
terion measures. In reality, such work samples are The new MCAT. In 1977, a revised version
quite rare and present a number of both practical of the MCAT consisting of six subtests (Biol-
and theoretical difficulties (Hartnett & Willing- ogy, Chemistry, Physics, Science Problems, Skills
ham, 1980). Analysis: Reading, and Skills Analysis: Quan-
titative) replaced the original MCAT which
GRE with Hispanics. Whitworth and Barrientos contained only four subtests (Science, General
(1990) compared Anglo and Hispanic graduate Information; Verbal Ability; and Quantitative
students on their respective performances on the Ability).
GRE-V, GRE-Q, and GRE Analytic test scores,
and their undergraduate and graduate grades, to Validity. There is a substantial body of literature
see how these variables predicted graduate aca- on the MCAT, with a great emphasis on its crite-
demic performance. The sample were students rion validity as a predictor of first-year medical
admitted to graduate studies at the University of school grades, and to a lesser extent as a predictor
Texas at El Paso during a 5-year period; the sam- of scores on the National Board of Medical Exam-
ple consisted of 320 Hispanics and 632 Anglos. iners examinations, particularly part I (NBME-
A statistical analysis indicated that Anglos scored I), which examines knowledge of the basic sci-
higher than Hispanics on all three GRE variables ences, and less frequently part II (NBME-II),
and on both undergraduate and graduate GPA. which examines knowledge of the “clinical”
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 349

sciences (R. F. Jones & Adams, 1982). R. F. Jones 4. The pattern of correlations between MCAT
and Thomae-Forgues (1984) indicate that there subtest scores and performance in specific areas
are five sets of questions regarding the validity of of the medical-school curriculum tended to be
the MCAT: consistent with content similarities – for exam-
ple, the Chemistry MCAT subtest correlated .41
1. How do MCAT scores compare in predictive with grades in course work in Biochemistry (the
validity with undergraduate GPA? Biology subtest correlated .31), while scores on
2. Do MCAT scores contribute unique infor- the Biology MCAT subtest correlated .29 with
mation not already provided by undergraduate grades in Microbiology (the Physics subtest cor-
GPA? related .08).
5. How well does the MCAT do? The authors
3. What is the relative predictive validity of the
point out that despite its simplicity, this is a com-
individual MCAT scores in relation to overall per-
plex question, and one needs to take into account:
formance in the basic medical sciences?
(1) the restricted range of the examinees –
4. What is the relative predictive validity of the students who have been accepted into medical
individual MCAT scores in relation to perfor- school, rather than applicants who have taken
mance in specific areas of the medical school the MCAT; (2) restricted range of medical-school
curriculum? GPA; (3) restricted reliability of some of the class-
5. How well does the MCAT predict medical room exams on which medical-school GPA is
school competence? calculated. Given these restrictions, the authors
conclude that MCAT scores show “fairly strong”
To answer these questions, R. F. Jones and predictive validity with first-year medical-school
Thomae-Forgues (1984) analyzed data from grades, and “extremely strong” predictive valid-
some 20 medical schools and concluded the ity with NBME-I examination scores. In general,
following: these results are quite similar to those obtained by
1. When the criteria were medical school other graduate- and professional-school admis-
course grades, MCAT-combined scores were sim- sion test programs.
ilar to undergraduate GPA in their predictive Overall, one can conclude that the MCAT has
value. However, no single MCAT score tended to significant predictive validity for medical-school
be correlated with medical school grades as highly grades in the first 2 years, and for scores on the
as undergraduate science GPA. When the criteria NBME-I. In particular, the Biology and Chem-
were NBME-I examination scores, MCAT scores istry subtests seem to be the most valid across
in combination were substantially better predic- medical schools. MCAT scores also add unique
tors of performance than undergraduate grades. predictive information to other variables such as
2. How much predictive validity do the MCAT undergraduate GPA.
scores contribute? The increase in the average
multiple correlation when MCAT scores were Admissions vs. advising. Most studies of the
added to the GPA was .11 to .14, when medical- validity of the MCAT are carried out to assess how
school course grades were the criterion, and .29, well the MCAT predicts medical-school perfor-
when NBME-I examination scores were the crite- mance, typically defined in terms of GPA and/or
rion. The authors indicated that the MCAT scores scores on the NBME. These findings are of use to
improved predictability by as much as 90% with the admissions committees who may or may not
course grades, and by nearly 300% with NBME place substantial weight on the MCAT scores of
examination scores. a potential applicant.
3. Of the various subtests, Chemistry had Donnelly et al. (1986) went one step further.
the highest average correlation with medical- They first studied a variety of variables, such as
school grades, with Biology and Science Prob- gender, undergraduate GPA, and other demo-
lems slightly less. In more than two thirds of the graphic aspects, to determine which would cor-
samples studied, either the subtest of Chemistry relate significantly with scores on the NBME-I at
or of Biology was the best predictor of medical their institution. They found that the best pre-
school grades and NBME-I exam scores. diction was achieved with a regression equation
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

350 Part Four. The Settings

Table 13–9. Actual vs. Predicted Performance on NBME-I (Donnelly et al., 1986)
Predicted performance
Actual performance High pass Average pass Low pass Fail
High pass (n = 76) 26 (92.9%) 50 (24.2%) 0 (0.0%) 0 (0.0%)
Average pass (n = 179) 2 (7.1%) 140 (67.6%) 27 (42.9%) 10 (14.7%)
Low pass (n = 65) 0 (0.0%) 16 (7.7%) 30 (47.6%) 19 (27.9%)
Fail (n = 46) 0 (0.0%) 1 (0.5%) 6 (9.5%) 39 (57.4%)
Totals: 28 207 63 68

composed of the MCAT (average of four sub- quite generalizable (e.g., Pearlman, Schmidt, &
tests) and grades in anatomy courses (four such Hunter, 1980; Schmidt & Hunter, 1977; Schmidt,
courses). The equation was cross-validated and Hunter, Pearlman, et al., 1979).
then the results were sent to the currently enrolled An analysis of 726 validity studies of the Law
medical students who had not yet taken the School Admissions Test (LSAT) as a predictor of
NBME-I. This was meant as a counseling device first-year grades in law school, indicated that the
so that students for whom failure was pre- “average true validity” was estimated to be .54,
dicted could take remedial steps. Incidentally, the although the values varied substantially across
regression equation correlated .85 with NBME- different law schools and as a function of when
I scores, and .90 when it was cross-validated the study was conducted (R. L. Linn, Harnisch,
(this is somewhat unusual, as more typically cor- & Dunbar, 1981). Although a similar analy-
relation coefficients drop in value when cross- sis is not as yet available for the MCAT, it is
validated). How well did the regression equation most likely that the results would be about the
predict actual NBME-I scores? Table 13.9 gives same.
the results. The percentages indicate the condi-
tional probabilities. For example, 57.4% of those
Method effect. Nowacek, Pullen, Short et al.,
for whom the regression equation predicted fail-
(1987) studied all students who entered the Uni-
ure, in fact did fail, while 27.9% of those for whom
versity of Virginia School of Medicine in the years
the equation predicted failure did obtain a low
1978 through 1984 (n = 974). They determined
pass. Notice that overall the results indicate a high
that MCAT scores predicted the NBME-I scores
degree of accuracy (92.6%), but that the accuracy
well (rs in the .40 to .50 range), and predicted
is greater in predicting those who do pass rather
medical-school course grades less well (rs in the
than those who do fail.
.30 to .40 range). Undergraduate science GPA
predicted medical-school course grades well (rs
in the .40 range), but predicted NBME-I scores
Generalizability of regression equations. To
less well (rs in the .30 range). These authors
have any practical value, a specific regression
suggested that there might be a method effect
equation should be applicable to more than just
present; both the MCAT and the NBME-I are
one group or class. If a regression equation is
long, multiple-choice, paper-and-pencil, stan-
used for admission decisions, the results obtained
dardized tests. Both undergraduate and gradu-
with one class must be generalizable to the next
ate GPA include laboratory work, test data of
class.
various types, subjective ratings, group projects,
In the area of employment testing, consid-
and other less well-defined activities. The corre-
erable variability in the validity coefficients is
lational pattern may well reflect the underlying
observed from one study to another, even when
nature of these variables.
the tests and the criteria used seem to be essen-
tially identical. A series of studies showed that
such variability was due to various statistical and Interstudy variability. There is a great deal of
research artifacts, such as sampling error, unre- variability in the predictive results reported
liability of the criteria used, and restriction of from study to study. For example, with first-
range; the conclusion is that validity results are year medical-school grades as the criterion, in
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 351

one study the Physics subtest correlated .12 (C. scores correlated .32 with first-year GPA, .27 with
M. Brooks, Jackson, Hoffman, et al., 1981), in second-year GPA, .39 with NBME-I scores, and
another .02 (M. E. Goldman & Berry, 1981), and .37 with NBME-II scores.
.47 in yet another (McGuire, 1980).
MCAT with black students. D. G. Johnson,
Performance in the clinical years. Most studies Lloyd, Jones, et al. (1986) studied medical stu-
of the validity of the MCAT focus on prediction dents at Howard University College of Medicine,
of grades in the first 2 years of medical school, a predominantly black institution. The criterion
designated as the basic science years, and on the of performance consisted of grades in all 4 years
parallel part I of the NBME. of medical school and scores on both parts I and
Carline, Cullen, Scott, et al. (1983) focused II of the NBME exams. In general, the predictive
on the last 2 years of medical school, designated validities of the MCAT scores and of undergradu-
as the clinical years, and on the parallel part II of ate GPA were found to be similar to those of stud-
the NBME. The validity coefficients of the MCAT ies with white medical students, and the results
with the NBME-II ranged from .03 to .47, with supported the use of the MCAT as an admissions
most coefficients in the high .20s to mid .30s criterion.
range. Although these are not particularly high,
as expected given the homogeneity of third- and Coaching. Although most of the studies on
fourth-year medical students, they were gener- coaching have focused on the SAT, and to a lesser
ally higher than the correlations between under- extent on the Law School Admission Test (LSAT),
graduate science GPA and performance on the there is also concern about the MCATs. N. Cole
NBME-II. (1982) listed six components of test preparation
programs (or coaching), and each of these com-
New MCAT vs. original MCAT. We have basically ponents has rather different implications for test
considered only studies with the new MCAT. The validity:
literature suggests that the new MCAT has bet-
ter predictive validity than the original MCAT. 1. supplying the correct answers (as in cheating)
McGuire (1980) for example, reports a multiple 2. taking the test for practice
regression of MCAT scores and undergraduate 3. maximizing motivation
science GPA as correlating .57 with class rank 4. optimizing test anxiety
in medical school, slightly better than the .50 5. instruction in test-taking skills
obtained with the original MCAT. In this study,
6. instruction in test content
MCAT subtests correlated from a low of .25 to
a high of .47 with class rank (median r of about Each of these components has a different
.43), whereas science undergraduate GPA corre- impact depending upon whether the targeted test
lated .41 with class rank in medical school. is an aptitude test or an achievement test. Apti-
tude tests are designed to measure abilities that
Differential validity. Does the validity of MCAT are developed over a long period of time, and
scores in predicting academic performance in therefore should be relatively resistant to short-
medical school vary for students from different term intervention. Achievement tests reflect the
undergraduate institutions? Zeleznik, Hojat, and influence of instruction, and performance on
Veloski (1987) studied students from 10 under- them should in fact be altered by well-designed
graduate universities who were all attending the courses of instruction.
same medical school. GPA for first and second R. F. Jones (1986) studied national samples of
year of medical school as well as scores on the students who had taken the MCAT and compared
NBME exams were used as criteria. Obtained cor- the test scores of those who had been coached
relations ranged from a low of .03 to a high of .66 with those who had not. Coached examinees
with significant differences between institutions. did better on the Biology, Chemistry, Physics,
The MCAT was more or less valid, depending and Science Problems subtests, and equally well
on the students’ undergraduate institution. Inci- on the Skills Analysis: Reading subtest. Mixed
dentally, for all 10 institutions, combined MCAT results were obtained on the Skills Analysis:
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

352 Part Four. The Settings

Quantitative subtest. The effect was small, how- Reliability. K-R reliability coefficients typically
ever, and attributable to the science review com- range in the .80s. However, the DAT may place too
ponent of these coaching programs. much emphasis on speed; test-retest or parallel
form reliabilities would be more appropriate, but
these are not reported (DuBois, 1985).
The Dental Admission Testing
Program (DAT)
Validity. Scores on the Total Science and the Aca-
The Dental Admission Testing Program (DAT) demic Average typically correlate in the .30s both
is administered by the Council on Dental Edu- with first-year grades in dental school and with
cation of the American Dental Association, and later performance on the National Boards for
has been in place on a national basis since 1950. Dentistry exams. The DAT seems to be as good a
The DAT is designed to measure general aca- predictor of first-year dental-school grades as is
demic ability, understanding of scientific infor- undergraduate GPA. DuBois (1985) points out
mation, and perceptual ability. The test results that the reported validity coefficients may be
are one of the sources of information that dental attenuated (i.e., lowered) by the fact that they
schools use in their admission procedures. The are computed on admitted students rather than
DAT is intended for those who have completed the broader group of those who take the exam,
a minimum of 2 years of collegiate basic science that grades in dental school typically show a
study. restricted range, and that data from various insti-
tutions that have somewhat different curricula
Description. The current DAT contains four are lumped together – in other words, the same
sections: concerns expressed with the MCATs.
1. A subtest of 100 multiple-choice questions
covers the natural sciences – specifically the Criticism. The DAT may be assessing academic
equivalent of college courses in biology and in skills rather than the combination of psychomo-
chemistry (both organic and inorganic). This tor skills and problem solving that are required
section requires the simple recall of basic scien- in the everyday practice of dentistry (Cherrick,
tific information and is called Survey of Natural 1985). Another criticism is that because the spe-
Science. cific form of the DAT changes regularly, there
2. The Perceptual Ability subtest contains 90 are no data published to show that the different
multiple-choice items that require the sub- forms of the DAT are equivalent (DuBois, 1985).
ject to visually discriminate two- and three-
dimensional objects.
TESTS FOR LICENSURE AND
3. The Reading Comprehension subtest consists CERTIFICATION
of 50 multiple-choice items based on a read-
ing passage, similar to reading material in dental In the United States approximately 800 occupa-
school. tions are regulated by state governments, includ-
4. The Quantitative Reasoning subtest contains ing occupations such as barber, physician, and
50 multiple-choice questions that assess the per- psychologist. Other occupations, travel agent or
son’s ability to reason with numbers and to deal auto mechanic, for instance, are regulated by var-
with quantitative materials. ious boards and agencies (Shimberg, 1981). For
many of these occupations, licensure or certifi-
cation involves a test or series of tests. Licensure
Scoring. The test scores for the DAT are reported is a process whereby the government gives per-
on a standard scale score that ranges from −1 mission to an individual to engage in a particular
to +9, with a mean of 4 and SD of 2. A total occupation. The granting of the license reflects
of 10 scores are reported including three com- minimal competency, and usually there are def-
posite scores: total science (based on section 1), initions of what a licensed practitioner may do.
academic average (based on sections 1, 3, and 4), Furthermore, it is illegal for someone who is not
and perceptual ability (based on section 2). licensed to engage in any of the defined practices.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 353

Certification is the recognition that a person Validity. This is a challenging issue for licens-
has met certain qualifications set by a creden- ing/certification tests. Certainly content validity
tialing agency and is therefore permitted to use a is very important, and in a certain sense, relatively
designated title. Individuals who are not certified easy to establish. These tests are usually designed
are not prohibited from practicing their occupa- to assess knowledge and skills in a particular area
tion. For some occupations, both licensing and and are usually put together by experts in that
certification may be pertinent. For example, in area, often on the basis of a job or performance
order to practice, a physician must be licensed by analysis. Thus content validity is often built into
the state. In addition, she or he may wish to be the test.
certified by one of a number of medical specialty Criterion validity is more difficult to assess.
boards, such as pediatrics or psychiatry. Often, Consider psychologists, for example. Some work
certification standards reflect higher degrees of in a private psychotherapeutic setting, and may
competency than those of licensure. see patients who are highly motivated to change
A rather wide variety of tests are used for licens- their lifestyle. Other psychologists work in men-
ing and certification purposes, some national in tal hospitals where they may see patients with
scope, and others developed locally, some the multiple disabilities and for whom the hospi-
result of national test organizations such as Edu- tal contact may represent a mechanism for the
cational Testing Service or Psychological Corpo- maintenance of the status quo. Some teach and
ration, others reflecting the work of local boards. carry out research in university settings, while
The format of many of these tests consists of others may be consultant to business organiza-
multiple-choice items because they are econom- tions. What criteria could apply to these diverse
ical to score, especially with large number of can- activities?
didates, and the results can be readily tabulated. Similarly, construct validity is difficult to apply.
Often these exams are accompanied by work sam- Most licensing and certification tests are not
ples, e.g., a flying test for airplane pilots. concerned with global and personological con-
The purpose of licensing exams, and to some cepts such as adjustment, competence, or even
degree, certification tests, is to protect the pub- professionalism.
lic’s welfare and safety, rather than to predict job
success. Therefore, these tests should assess basic Cutoff scores. A cutoff score is that score used
skills and abilities to carry out professional or to separate those who pass a test from those
occupational tasks safely and competently. Typ- who do not. In school courses, a cutoff score on
ically, licensing exams deal with an applicant’s exams is typically pegged at 70%. Cutoff scores
knowledge and skill at applying relevant prin- are used widely in a variety of settings, from
ciples, laws, rules, and regulations (Shimberg, schools to personnel decisions. Sometimes cutoff
1981). Rather than assess a full range of diffi- scores are set on the basis of clearly defined crite-
culty, such tests will use items that assess min- ria, and sometimes they are set quite arbitrarily
imal competency and thus should be relatively (for a detailed discussion see Cascio, Alexander, &
easy for individuals in that occupation. Some- Barrett, 1988).
times licensing and certification tests yield only What is an appropriate cutoff score? Legal chal-
a total score, so that a candidate who is weak lenges have resulted in a series of cases involving
in area A can compensate by doing quite well cutoff scores on tests, where the conclusion was
in area B. Other tests yield subtest scores and that a cutoff score should be consistent with the
may specify the required passing score for each results of a job analysis, it should permit the selec-
subtest. tion of qualified candidates, and it should allow
A more complicated issue is the setting of the an organization to meet its affirmative action
cutoff score. On some tests the cutoff score is a goals. Cutoff scores should be based upon aspects
relative standard; for example, the top 80% will such as reliability, validity, and utility, and should
pass and the bottom 20% will fail, no matter what relate to the proficiency of the current work force.
the score distribution is. On other tests, the cutoff How are cutoff scores set? There are
score represents an absolute standard, and theo- two basic ways that parallel norm-referenced
retically every examinee could pass or fail. vs. criterion-referenced approaches. Thorndike
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

354 Part Four. The Settings

(1949) suggested the “method of predictive yield” recommend that 10 to 15 judges be used to estab-
which now would be called a human-resources lish cutoff scores.
planning approach (Cascio, Alexander, & Barrett, The Ebel procedure is similar, but judges are
1988). That is, information regarding projected also asked to rate the relative importance of each
personnel needs, the past history of the propor- item. The Nedelsky method requires the test
tion of offers accepted, and a large sample dis- judges to identify those distractors of a multiple-
tribution of applicants’ test scores are all studied choice question that a “minimally competent”
to set a cutoff score on a test that will yield the examinee would recognize as incorrect. Then the
number of applicants needed. For example, if I expected chance score over the remaining choices
have 400 applicants for 20 positions, and in the is computed, and these scores are averaged across
past 60% of applicants who were offered employ- judges. Thus the cutoff score represents an above
ment accepted, then the cutoff score will be one chance score that takes into account the obvious
that identifies the top 34 individuals. distractors. Such a standard is quite lenient. All
Another way to determine a cutoff score is sim- three methods have rather poor interjudge reli-
ply to base it on the distribution of applicants’ ability and are time consuming (for reviews see
test scores, such as at the mean, the 80th per- Berk, 1986; Shepard, 1980).
centile, 11/2 standard deviations above the mean, In the second criterion-referenced method,
etc. Norm-referenced methods like these are rela- contrasted-groups analysis is used. A group of
tively simple and minimize subjective judgment. clearly competent individuals is compared with
They may be acceptable where there is a need to group of either marginally competent or not
create a list of eligible applicants; but they would competent individuals. Once these two groups
probably not be acceptable in situations where have been identified, the cutoff score is defined
minimum competency needs to be identified. as the point of intersection of the two test-score
There are basically two methods to set distributions.
criterion-referenced cutoff scores. In one, the Cascio, Alexander, and Barrett (1988) suggest
experts provide the judgments about the test that it is unrealistic to expect to determine a single
items, and in the other a judgment is made about best method of setting cutoff scores, and the pro-
the criterion performance of individuals. When cess should begin with a careful job analysis. They
experts are asked to provide the judgments, the also suggest that cutoff scores be set high enough
procedure used typically includes one or more to ensure the meeting of minimum standards of
of the methods proposed by Angoff (1971), Ebel job performance and to be consistent with nor-
(1972) or Nedelsky (1954). mal expectations of acceptable proficiency within
The method that is often used is called the the work force.
Angoff method (Angoff, 1971), after its author.
Angoff suggested that the minimum raw score for
SUMMARY
passing can be developed on a test by looking at
each test item and deciding whether a “minimally Tests are used in the school context for a vari-
acceptable person,” i.e., a barely qualified person, ety of purposes, many of which were discussed in
could answer each item correctly. A related vari- Chapters 9 and 12. In this chapter, we looked at
ation of this procedure, also suggested by Angoff, the California Achievement Tests as applicable to
is to state the probability, for each item, that the both elementary and secondary schools. At the
“minimally acceptable person” would answer the high-school level we illustrated several content
item correctly. The mean probability would then areas, including the assessment of social compe-
represent the minimally acceptable score. These tence, the GED tests used to award high-school
judgments could be made by a number of judges equivalency diplomas, and the NAED used as
and an average computed. The computed cut- a national thermometer of school achievement.
off score, under this method, is rather stringent; For college, the focus was on the SAT, while for
for example, in a study of the National Teacher graduate school it was the GRE. Finally, we briefly
Examination, 57% of the examinees would have covered some of the tests used for admission into
failed the exam (L. H. Cross, Impara, Frary, et al., professional schools and some of the issues con-
1984). G. M. Hurtz and N. M. R. Hertz (1999) cerned with licensure and certification.
P1: JZP
0521861810c13 CB1038/Domino 0 521 86181 0 February 24, 2006 8:12

Testing in the Schools 355

SUGGESTED READINGS Wainer, H. (1993). Measurement problems. Journal of


Educational Measurement, 30, 1–21.
Green, B. F., Jr. (1978). In defense of measurement.
American Psychologist, 33, 664–670. An excellent article that looks at 16 unsolved problems in
educational measurement and what might be potential solu-
A well-written review of many of the criticisms of psycholog- tions.
ical tests, especially as they apply to educational settings.

Kaplan, R. M. (1982). Nader’s raid on the testing indus-


try. American Psychologist, 37, 15–23. DISCUSSION QUESTIONS
A rebuttal of two arguments used by Nader and others as a
criticism of the SAT, namely that the SAT is no better than 1. Why are tests less reliable when administered
chance in predicting college performance and that the use of to young children?
SAT scores denies low-income students the opportunity to be
admitted to college.
2. How would you make sure that a teacher rating
scale has adequate content validity?
Menges, R. J. (1975). Assessing readiness for profes-
3. What was your experience with the SAT or
sional practice. Review of Educational Research, 45,
other college-admission procedure?
173–207.
4. This chapter mentions outcome-oriented tests
A review of how readiness for professional practice can be
measured in a variety of ways, especially with regard to the vs. process-oriented tests. Think back to the var-
helping professions. ious tests you are now familiar with. How would
Powers, D. E. (1993). Coaching for the SAT: A sum-
you classify each of these? Could a test be both?
mary of the summaries and an update. Educational 5. As this chapter indicates, there is a gender gap
Measurement: Issues and Practice, 12, 24–39. on the SAT. What might be some of the reasons?
A review of some of the key issues about coaching and of What evidence could be obtained to shed light
several meta-analyses that have been done. on this situation?
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

14 Occupational Settings

AIM This chapter looks at some issues and examples involved in testing in occupa-
tional settings, including the military and the police. Many of the tests that are used
are tests we have already seen – for example, tests of personality such as the CPI
(Chapter 4), tests of intelligence such as the WAIS (Chapter 5), or tests to screen out
psychological problems such as the MMPI (Chapter 7). Our emphasis here will be on
issues and tests not discussed before.

SOME BASIC ISSUES characteristics of the candidate that most closely


match the requirements of the job (D. Arthur,
Purposes of Testing 1994). Sometimes tests are used to screen out
candidates; those who pass the individual testing
In the world of work, testing can serve a number
are then given individual interviews. Sometimes
of purposes including the following:
tests are used after the interview, generally to con-
1. To determine potential for success in a pro- firm the interview findings.
gram. For example, if a program to train
assembly-line workers requires certain basic
mathematical and reading skills, candidates who Employment Testing
do not have such skills could be identified, and Employment testing is often used to evaluate the
remediation given to them. promotability of an employee. Sometimes they
2. To place individuals programs. This involves are used to identify employees that have certain
matching the candidates’ abilities and competen- specific skills, or for career advising. Most tests
cies with the requirements of specific training seem to be administered to middle managers and
programs. supervisors, followed by clerical workers, execu-
3. To match applicants with specific job open- tives, and professionals (D. Arthur, 1994).
ings.
4. To counsel individuals, for career advance-
Government Regulations
ment, or career changes, for example.
5. To provide information for program planning Formal governmental regulation of testing began
and evaluation. in 1968 when the U.S. Secretary of Labor signed
the first Testing and Selection Order that indi-
cated that government contracts were required
Preemployment Testing
to specify that the contractor could not discrimi-
Preemployment testing generally serves two nate against job applicants because of race, color,
purposes: (1) to elicit a candidate’s desirable religion, gender, or national origin. This applied
and undesirable traits, and (2) to identify those to selection procedures including testing.

356
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 357

In 1978, a Uniform Guidelines on Employee below). When asked to rank various procedures
Selection Procedures was adopted. These guide- as to validity, simulations, ability tests, and per-
lines provide a framework for determining the sonal history forms were ranked as the most valid,
proper use of tests when used for employment while personality tests and projective techniques
decisions (D. Arthur, 1994). The guidelines rec- as least valid.
ognized the three major categories of validity, i.e.,
criterion, content, and construct.
Instruments Used in Industry
Typically, textbooks (such as Guion, 1965a) list
Personnel Selection
five major types of tests that are used in occupa-
In the area of personnel selection, there seem to tional settings:
be five major themes in the literature, all of them
1. General measures of intellectual ability. These
basically relevant to the issue of criterion validity
include individual tests such as the WAIS (dis-
(Schmitt & Robertson, 1990): (1) job analysis –
cussed in chapter 5), or group screening measures
much of this research has focused on the nature
such as the Wonderlic Personnel Test (discussed
and quality of job-analysis ratings; (2) predictor
below).
development and measurement – the focus here
has been on such predictors as assessment cen- 2. Measures of specific intellectual abilities.
ters, interviews, biodata, and personality tests, as These might involve clerical aptitude tests, mea-
well as other selection procedures; (3) criterion sures of spatial relations, measures of creativity,
development and measurement – much of the of abstract reasoning, and of numerical reason-
focus here has been on job performance ratings, ing. Many of these tests are packaged as batteries
issues such as what variables increase or decrease that assess multiple aptitudes.
the accuracy and validity of ratings; (4) validity 3. Measures of sensory and psychomotor abil-
issues – such as the nature of validity and the use ities including vision testing, tests of coordina-
of metaanalysis; (5) Implications of implement- tion, and of manual dexterity. Many of these tests
ing a selection strategy – for example, how can involve apparatus rather than paper-and-pencil.
adverse impact be minimized. For example, in the O’Connor Tweezer Dexterity
Test the subject uses tweezers to pick up pins and
place them as rapidly as possible in a board that
What Methods Are Used? has 100 holes. In the Purdue Hand Precision Test,
A. M. Ryan and Sackett (1987a) surveyed about there is a revolving turntable with a small hole in
1,000 industrial and organizational psychologists it. As the turntable revolves, the subject inserts a
regarding individual assessment practices. One of stylus in an attempt to touch target holes beneath
the questions asked was what methods were used the turntable. The apparatus records the number
in assessing managerial potential. The results are of correct responses, number of attempts, and
presented – next. time elapsed.
4. Measures of “motivation,” often used as a
Method Used by catchall phrase to include interest inventories
interview 93.8% such as the SVIB (see Chapter 6), personality
personal history form 82.7% inventories such as the CPI (see Chapter 4), and
ability tests 78.4% projective techniques (see Chapter 15).
personality &/or interest tests 77.8% 5. Specially derived measures such as biograph-
simulation exercises 38.2% ical inventories (or biodata), standardized inter-
projective tests 34.0% views, and work samples.

The most frequently used test was the Watson-


How Is Job Success Measured?
Glaser Critical Thinking Appraisal. Among the
most frequently used personality tests were the Tests are often validated against the criterion of
16 PF, the CPI, and the MMPI (discussed in job success, but there are many ways of defining
Chapter 4). Of the simulation exercises, the most such a global variable, all of which have limita-
frequent was the “in-basket” (to be discussed tions (Guion, 1965a):
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

358 Part Four. The Settings

1. Quantity and/or quality of production. In a that is, they change over time. In a typical study,
factory situation, quantity might be measured by a test is administered to a sample of applicants
the actual number of units produced within a or new employees, and the test is then validated
time period. Quality might be assessed by the against some measure of job performance during
number of units that do not pass inspection or an initial period of employment, perhaps after 3
meet specific engineering criteria. In the area of or 4 months. What is of interest, however, is per-
sales, quantity might be defined in terms of dollar formance over a much longer period of time. It
amounts, and quality in terms of client contacts is not unusual for performance to increase over
per sale. The possibilities are quite numerous and time, but not all individuals improve at the same
may differ drastically from each other. rate or in the same amount.
2. Personnel records. These might provide oper- Ghiselli and Haire (1960) studied 56 men who
ational criteria such as absenteeism, number of had been hired as taxicab drivers, none of them
industrial accidents, etc. having had prior experience in this type of work.
3. Administrative actions, which may be At the time of hiring, these men were given a bat-
included in personnel records, might cover tery of tests ranging from arithmetic, speed of
such processes as promotions, pay increases, reaction, distance discrimination, and interest in
resignations, and so on. occupations dealing with people. The criterion of
job performance was dollar volume of fares. Such
4. Performance ratings, made on rating scales of
data were collected for each of the first 18 weeks
various types.
of employment. The results showed that during
5. Job samples, a standardized sample of work the 18 weeks there were significant changes in
for which all persons perform the same tasks. The average productivity (greater in the last 3 weeks
performance of the applicants can then be com- than in the first 3 weeks), in the range of indi-
pared directly and ranked. Such job samples are vidual differences (larger standard deviation in
often used as preemployment tests, for example, the last 3 weeks), and in the order of individuals
typing tests. (r = .19 between productivity on the 1st week
and productivity on the 18th week). The valid-
The Criterion Problem ity of the tests also changed substantially over the
time period. For example, the inventory designed
Do different methods of measuring job perfor- to measure interest in dealing with people corre-
mance, such as work samples, ratings by super- lated .42 against the criterion for the first 3 weeks,
visors, self-ratings, measures of production out- but dropped to .13 against the criterion of the last
put, etc., result in different validity results for the 3 weeks. In fact, the tests that correlated signifi-
same tests? cantly with the criterion for the first 3 weeks were
Nathan and Alexander (1988) conducted different from those that correlated significantly
metaanalyses of validity coefficients from tests with the criterion of the last 3 weeks. An analysis
of clerical abilities for five criteria: supervisor of productivity over all 18 weeks, and of rate of
ratings, supervisor rankings, work samples, pro- improvement in production, showed that those
duction quantity, and production quality. They tests that predicted one of the criteria would be
found that for the first four criteria, high test poor in predicting the other criteria.
validities were obtained, with validities resulting
from rankings and from work samples on the
average higher than the validities resulting from The Method Problem
ratings and from quantity of production. Only Quite often, different results are obtained in dif-
the fifth criterion, quality of production, had low ferent studies because different methods were
validity and did not generalize across situations. used. For example, in the area of market research,
questions (i.e., test items) are often presented
either verbally or pictorially. Weitz (1950) won-
Criteria Are Dynamic
dered whether the two methods would yield
Ghiselli and Haire (1960) suggested that the crite- equal results. A sample of 200 adult women was
ria against which tests are validated are dynamic, surveyed. One half were asked verbal questions
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 359

about the design of a cooking range (e.g., Do 3. The validity of psychomotor-abilities tests can
you prefer a table top oven or a high oven?), and vary on average from .09 to .40 across job families;
the other half were shown sketches of the var- thus, under some circumstances the validity of
ious options. All subjects were interviewed by psychomotor tests may be very low.
the same interviewer, and the two subsamples 4. Even the smallest mean validity of cognitive
were matched on socioeconomic background. tests (such as .27 for sales clerks) is large enough
For eight of the nine choice questions, there were to result in substantial labor savings, if such tests
significant differences between the two groups. are used for selection.
For example, when asked whether they preferred 5. As job complexity decreases, the validity of
the burner controls on the back or the front panel, cognitive tests decreases, but the validity of psy-
91 of the 100 verbal responses indicated a pref- chomotor tests increases.
erence for the back panel, but for the pictorial
6. If general, cognitive ability alone is used as a
responses only 73 of 100 gave such a preference.
predictor, the average validity across all jobs is .54
Weitz (1950) concluded that these two question-
for a training success criterion and .45 for a job
naire techniques are not interchangeable, and
proficiency criterion.
that data obtained from the two methods should
not be equally evaluated. 7. For entry-level jobs, predictors other than tests
of cognitive ability have lower validities.
8. Validity could be increased by using, in addi-
SOME BASIC FINDINGS tion to cognitive tests, other measures such as
social skills and personality that are relevant to
Predicting Job Performance specific job performance.
Schmidt and Hunter (1981) argued that: A classic example is that of General Electric,
1. Professionally developed cognitive ability tests which because of government pressure aban-
are valid predictors of performance on the job doned the use of job-aptitude tests in hiring. The
and in training for all jobs in all settings. company eventually realized that a large percent-
age of the people hired without use of such tests
2. Cognitive ability tests are equally valid for
were not promotable. Thus “adverse impact” had
minority and majority applicants.
been merely shifted from the hiring stage to the
3. Cognitive ability tests are fair to minority promotion stage (Schmidt & Hunter, 1981).
applicants in that they do not underestimate the
expected job performance of minority groups.
4. The use of cognitive ability tests for selection Job Performance: Cognitive and Biodata
in hiring can produce substantial savings for all Overall, the two most valid predictors of job
types of employers. performance are cognitive ability tests and bio-
Numerous studies have calculated that the use data forms. J. E. Hunter and R. F. Hunter (1984)
of valid tests in selecting individuals for spe- reviewed the available literature and estimated
cific jobs saves companies, the federal govern- the average validity of general cognitive-ability
ment, and the armed forces, millions of dollars tests against the criterion of supervisory ratings of
(e.g., J. E. Hunter & R. F. Hunter, 1984; Schmidt, overall job performance to be .47, and for biodata
Hunter, Pearlman, et al., 1979). J. E. Hunter and forms to be .37. The literature also indicates that
R. F. Hunter (1984) applied metaanalysis to liter- such findings are generalizable, thus the results
ally thousands of studies. Here are some of their of biodata forms are not limited to one specific
conclusions: situation.

1. Cognitive-ability tests have a mean validity of


about .55 in predicting training success, across all g as a Predictor of Occupational Criteria
known job families. Ree and Earles (1991) studied some 78,000 air-
2. There is no job for which cognitive ability does men in 82 job specialties and found that g or gen-
not predict training success. eral cognitive ability was the most valid predictor
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

360 Part Four. The Settings

of technical school grades. In Project A, another should select the best-scoring individuals from
large military study (McHenry, Hough, Toquam each racial subgroup in proportions that equal
et al., 1990), g was the best predictor of Army their representation in the applicant pool. The
performance measures. Ree, Earls, and Teachout U.S. Employment Service in fact does this with
(1994) in a study of Air Force enlistees who took scores on the General Aptitude Test Battery (see
the ASVAB found that g was the best predictor of discussion that follows).
job performance, with an average r of .42. Olea
and Ree (1994) studied Air Force navigator and
RATINGS
pilot students. Again, g was the best predictor of
criteria such as passing or failing the training,
Supervisors’ Ratings
grades, and work samples.
A supervisor’s rating is probably the most com-
mon measure of job performance, and the most
Cognitive Ability and Minorities
common criterion against which tests are evalu-
Although cognitive-ability tests are valid indi- ated. These ratings are used not only to validate
cators for on-the-job performance, they present tests, but for issues such as promotions and pay
a serious challenge, namely that in the United raises, as well as to assess the impact of training
States blacks score, on the average, about one SD programs. Other more “objective” criteria such as
lower than whites (J. E. Hunter & R. F. Hunter, salary or promotion history can be used instead
1984). If such tests are used to select applicants, of ratings, but most of these criteria have some
the result is what the courts called adverse impact. serious shortcomings. Rating scales can be con-
When this problem arose in the 1960s, the solu- structed from a variety of points of view, and the
tion seemed relatively straightforward: if cogni- literature has not clearly identified one form as
tive tests were unfair to black applicants, one only superior to others (e.g., Atkin & Conlon, 1978;
needed to make the tests fair, that is, to remove Dickinson & Zellinger, 1980).
the items that were culturally biased. Lawler (1967) suggested that the multitrait-
The evidence collected since then has not sup- multimethod approach (see Chapter 3) might
ported this approach. As discussed in Chapter 11, be a useful one for performance appraisal,
any test that is valid for one racial group is valid where the multimethod is replaced by multi-
for the other. Single-group validity, where a test rater. Rather than use a single supervisor’s rat-
is valid for one group but not another, and dif- ings, this approach calls for multiratings – i.e.,
ferential validity, where the test is less valid for ratings by supervisors, by peers, by subordinates,
one group than for another, are artifacts of small and by the individual – anyone who is familiar
sample size. If tests were in fact culturally biased, with the aspects to be rated of the individual’s
that would mean that the test scores for blacks performance.
would be lower than their true ability scores, and The multitrait requirement is met by rating
so their job performance would be higher than somewhere around three to five traits. One rat-
what is predicted by their test scores. In fact, how- ing should be a global one on quality of job
ever, the regression equations for blacks are either performance. The nature of the other ratings
equal to those of whites or overpredict the per- depends upon the purpose of the rating proce-
formance of blacks. The overwhelming evidence dure and the particular types of behaviors rele-
is that differences in average test scores reflect real vant to a specific job. Lawler (1967) suggests that
differences in abilities. These differences are likely whatever these behaviors are, the rating scales
the result of societal forces such as poverty and should be behavior-description anchored scales.
prejudice. Eliminate these and the test differences Rather than rate dimensions such as friendliness
should be eliminated. and adaptability (which cannot be rated reliably),
what should be rated is effort put forth on the job
and ability to perform the job.
Minimizing Adverse Impact
Lawler (1967) gives as an example some data
McKinney (1987) suggests that to maximize both based on a group of managers, where superi-
predictive efficiency and affirmative action, one ors’, peer, and self-ratings were obtained on three
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 361

Table 14–1. An Example of the Multitrait-Multimethod(rater) Approach (Lawler, 1967)


Superiors Peers Self
Ratings by: 1 2 3 4 5 6 7 8
Superiors
1. Quality of job performance
2. Ability to perform the job .53
3. Effort put forth on the job .56 .44


Peers
4. Quality of job performance .65 .38 .40
5. Ability to perform the job .42 
.52 .30 .55
6. Effort put forth on the job .40 .31 
.53 .56 .40


Self
7. Quality of job performance .01 .01 .09 .01 .17 .10
8. Ability to perform the job .03 
.13 .03 .04 .09 .02 .43
9. Effort put forth on the job .06 .01 
.30 .02 .01 .30 .40 .14

Note: The larger squares include the heterotrait-monorater coefficients; the smaller circles include the monotrait-
heterorater coefficients – i.e., the validity coefficients. The remaining coefficients are the heterotrait-heterorater
coefficients.

dimensions: quality of job performance, ability of a person’s performance or have different views
to perform the job, and effort put forth on the of what effective performance is.
job. The results are presented in Table 14.1. Others view the lack of agreement as reflective
Note that the validity coefficients, that is, the of bias in the self-ratings. That is, when individu-
correlations on the same trait by different raters, als are asked to rate their own performance they
should be highest of all. In this case, the ratings may tend to inflate their ratings on such aspects
by superiors and by peers do show such conver- as self-esteem.
gent validity, but the self-ratings do not. These
validity ratings are higher than the heterotrait-
heterorater coefficients, even in the case of the Rating Errors
self-ratings – as they should be to show discrim- Traditionally, ratings are subject to two types of
inant validity. Finally, the validity coefficients errors, identified as halo and bias. Halo refers to
should be higher than the heterotrait-monorater the tendency to rate an individual high or low on
coefficients – which they are. all dimensions because the person is outstand-
ingly high or low on one or a few dimensions.
Thus, if I like you because you are particularly
Self- vs. Other Ratings
friendly, I might be tempted to also automati-
The use of multiple sources for performance cally rate you as high on interpersonal skills, lead-
ratings has gained considerable acceptance due ership ability, intelligence, etc. Halo has always
to several advantages, among them greater reli- been regarded as a rating error, something that
ability and a stronger legal standing. Often could potentially be controlled through train-
however, there is a lack of agreement between ing of raters or through better rating forms. In
self-ratings and those provided by peers and fact, halo may not be an error but may result
by supervisors. For example, Harris and in increased validity coefficients – perhaps halo
Schaubroeck (1988) conducted a metaanalysis of serves to ensure that raters consider the “person
the literature and computed the average corre- as a whole” rather than pay attention to spe-
lation between self-and peer ratings to be .36, cific but perhaps unrepresentative critical inci-
between self-and supervisor ratings to be .35, and dents (Nathan & Tippins, 1990). The literature
between peer and supervisor ratings to be .62. also indicates that halo can be analyzed into two
Some view such lack of agreement as expected components: a true or valid component and an
because different raters observe different aspects illusory or invalid component. That is, if the
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

362 Part Four. The Settings

ratings across dimensions correlate substantially, ing the names of two individuals to be rated. The
this is taken as evidence of the halo effect. Such rater checks the one name who performs the job
high correlations reflect, in part, the real over- (or whatever the dimension being rated is) better.
lap among such dimensions (i.e., true halo) and ALL possible pairs of names are presented. The
irrelevant factors such as memory errors on the number of pairs can be computed by the formula:
part of the rater (i.e., invalid halo). Because of
such halo error, it is assumed that the obtained N(N − 1)/2
correlations among dimensions are larger than
where N is the number of persons. Thus, if there
the “true” correlations. Murphy, Jako, and Anhalt
were 12 persons being rated, there would be 66
(1993) reviewed the literature on the halo error
pairings. Clearly the usefulness of this technique
and concluded that the halo effect is not all that
is limited by the size of N, although there are tech-
common, that such an effect does not necessarily
niques available for larger groups (see Albright,
detract from the validity of ratings, and that it is
Glennon, & Smith, 1963; Guion, 1965a).
probably impossible to separate true from invalid
Despite their ubiquity, ratings have a number
components. (For another review, see Balzer &
of potential limitations. Some of the limitations
Sulsky, 1992.)
stem from the situation in which ratings are used.
Rater bias refers to the tendency of raters to
Often, for example, the supervisors have not had
pile up their ratings – one rater may be lenient
the opportunity to carefully observe the required
and rate everyone as above average, another rater
behavior. Or there may be “demand” character-
may be tough and rate everyone as below average,
istics in the situation – for example, a supervisor
and a third rater may restrict the ratings to the
who needs to discuss the ratings with each per-
middle (often called the leniency error or the error
son rated may not be willing to give extremely
of central tendency). The broader term of “rater
low ratings. Rating forms may also be at fault in
bias” covers all these cases.
not providing clear and unambiguous behaviors
Some techniques have been developed to
to be rated.
counter these errors. One is the forced distribu-
tion method where the rater is instructed to have
Behavioral Anchors
the ratings follow a specified distribution, usually
based on the normal curve. We could for exam- One strategy to reduce bias is to use rating scales
ple, force a rater to make 5% of the ratings fall in that have concrete behavioral anchors that illus-
the highest category and 5% in the lowest, 20% trate the performance aspects to be rated. Such
in the above average category and 20% in the behavioral anchors provide a standard frame-
below average categories, and 50% in the average work by which to evaluate a person’s performance
category. and provide examples of specific behaviors that
Another approach is to rank the individuals might be expected from good, average, or poor
being rated. This is a feasible procedure when the performers. On the other hand such behavioral
number of individuals is somewhat small. A vari- anchors may bias the ratings given by making
ation of the ranking method is called alternation- the described behavior more salient for the rater
ranking where the rater selects the best and the (K. R. Murphy & Constans, 1987). An example of
poorest person on the rating dimension, then a rating scale item with behavioral anchors, can
the next best and next poorest, and so on, until be found in Figure 14.1.
all persons have been ranked. Because ranks do P. C. Smith and Kendall (1963) proposed
not form a normal distribution, but a rectangu- the use of continuous graphic rating scales,
lar one, they must be converted to normalized arranged vertically, with behavioral descriptions
standard scores for statistical analyses. Tables for as anchors. The person making the ratings can
doing this are available (see Albright, Glennon, place a check at any position on the line and
& Smith, 1963, pp. 172–173 for an example that may also indicate some actual observed behav-
converts ranks to T scores). iors to support the check made. The approach
A third method is the paired comparison sys- used by these investigators was not to trick the
tem (Lawshe, Kephart, & McCormick, 1949). The rater by using forced-choice items, but to help the
rater is given a deck of 3 × 5 cards, each card bear- rater to make accurate ratings by using anchors
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 363

Variable to be related: list making and self-respect, while in collectivistic cultures


like Taiwan, individuals are discouraged from
boasting about their individual accomplishments
9 I am constantly making lists
and are expected to be more modest in their
self-ratings.
High
8 THE ROLE OF PERSONALITY
I often make lists of the
things I need to do Personality as Predictor of Job
7 Performance
It is often stated that individual personality
variables are relatively poor predictors of job
6 performance.
B. Schneider (1983) proposed an attraction-
Occasionally I make lists
Average selection-attrition framework for understand-
of the things I need to do
ing organizational behavior. That is, individuals
5
select themselves into and out of organizations,
and different types of people make up different
types of organizations. Thus personality variables
4 should be important in determining not only
which persons are seen as leaders, but also the
likelihood that a person will fit in and stay with
a particular company. (For a test of Schneider’s
3
hypothesis, see Bretz, Ash, & Dreher, 1989.)
I rarely make lists of the In the 1960s, a number of reviewers indicated
things I need to do
that the validity of personality measures for per-
2 sonnel selection purposes, and specifically as a
Low predictor of job performance, was low, although
most of the studies on which this conclusion was
1 I never make lists based were one-shot studies lacking a theoreti-
FIGURE 14–1. A rating scale with behavioral cal or conceptual framework (Guion & Gottier,
anchors. 1965). In fact, however, personality measures are
widely used in employee selection, and the gen-
that reflect observable behaviors rather than eral conclusion about their lack of validity has
inferences. changed. Which personality characteristics are
important for on-the-job performance depend
upon the specific job, among other things – we
Cultural Differences in Ratings
would not expect accountants, for example, and
The literature indicates that self-ratings are typ- circus performers to exhibit the same pattern of
ically one half SD higher than supervisory rat- relationships between on-the-job performance
ings. Subordinates tend to be more lenient in and personality aspects.
evaluating themselves than do their supervi- Barrick and Mount (1991) analyzed the lit-
sors (M. M. Harris & Schaubroeck, 1988). Farh, erature with regard to three job performance
Dobbins, and Cheng (1991) studied self-ratings criteria – job proficiency, training proficiency,
vs. supervisory ratings in the Republic of China and personnel data, for five occupational groups
(Taiwan). Their results indicated a “modesty” that included police, managers, and sales per-
bias – that is, Chinese employees rated their sons. They analyzed the personality measures
job performance less favorably than did their in terms of the “Big Five” model discussed in
supervisors. Quite clearly, Western individualism Chapter 4, i.e., extraversion, emotional stability,
stresses individual achievement, self-sufficiency, agree-ableness, conscientiousness, and openness
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

364 Part Four. The Settings

to experience. The results indicated that con- up of items from four of the five subscales –
scientiousness showed consistent relations with a dimension which the authors labeled as an
all job-performance criteria for all occupational “ideal employee” factor. They suggested that this
groups. Extraversion was a valid predictor for dimension represents the hypothesis that respon-
managers and sales, the two occupational fields dents present themselves in a manner that they
that involve social interaction. And both open- see as appropriate to the situation, i.e., as job
ness to experience and extraversion were valid applicants, without deliberately falsifying their
predictors of the training proficiency criterion answers.
across occupations. Hogan (1990) cites a meta-analysis done to
Tett, Jackson, and Rothstein (1991) also con- determine the relationship between the five per-
ducted a meta-analysis of the literature. They sonality dimensions and various organizational
found that personality variables did correlate sig- outcomes. Among the conclusions of this study,
nificantly with job criteria, although the “typical” are the following:
value of .24 was somewhat low. In addition, they
found higher values as a function of several vari- 1. The dimension of extraversion or social ascen-
ables. For example, where job analyses were used dancy is correlated with performance in sales and
to select predictors, in studies of applicants vs. management.
incumbents (i.e., people holding the job), and in 2. The dimension of emotional stability (also
military vs. civilian samples. These authors con- called adjustment and self-esteem) is correlated
cluded that current validational practices have a with upward mobility and leadership status.
number of correctable weaknesses; specifically, 3. The dimension of agreeableness or likabil-
they pointed to the need to carry out job analy- ity is correlated with supervisors’ ratings of job
ses and use valid personality measures. In general, performance.
most correlations between well-designed person- 4. Conscientiousness or dependability is corre-
ality inventories and job-effectiveness criteria fall lated positively with academic performance and
in the .40 to .60 range. negatively with indices of delinquency.
5. Finally, openness to experience or new ideas
The Big Five revisited. As discussed in Chap- is correlated with intellectual performance and
ter 4, the five-factor model of personality has rated creativity.
received rather wide, but not unanimous, sup-
port, and a number of investigators have urged
its use in the area of personnel selection (e.g., Repeated administrations. The use of personal-
Barrick & Mount, 1991; Cortina, Dortina, ity tests as screening devices, especially in high-
Schmitt et al., 1992). Some researchers have noted risk or sensitive occupations such as police officer
that the five-factor model is too broad to have pre- jobs, seems to be increasing. Without a doubt,
dictive usefulness (e.g., Briggs 1992, Mc Adams, the MMPI has dominated the field. Given that
1992). One study for example (Mershon & the MMPI is so frequently given, it is not unlikely
Gorsuch, 1988) found that 16 factors were bet- that a police officer, for example, may take the test
ter predictors of various occupational and psy- several times. What is the effect of such repeated
chiatric criteria than six scales that assessed the administrations on the obtained test scores? Note
five-factor model. Another study (Hough, 1992) that this question is closely related to, but differ-
showed that a nine-factor model had higher ent from, the question of test-retest reliability.
validities than the five-factor model in predict- As discussed in Chapter 7, the long-term test-
ing job performance. Schmit and Ryan (1993) retest reliability of the MMPI, where the interven-
administered a 60-item version of the NEO-PI ing period ranges from several months to several
(see Chapter 4) to a sample of college students and years, is significantly lower than its short-term
a sample of job applicants. A confirmatory factor reliability. Graham (1987) indicates that for nor-
analysis indicated that the five-factor model fit mal samples, typical MMPI test reliabilities for
the student data well, but not the job-applicant retest periods of a day or less are about .80 to .85,
data. In the data obtained from the job appli- for periods of 1 to 2 weeks they are .70 to .80, and
cants, the largest factor was one that was made for periods of a year or more they are .35 to .45.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 365

There are several possible explanations for the higher than the PRC managers. On the Confu-
increased change in scores – possible changes in cian Work Dynamism, PRC managers scored the
personality, or placing different interpretations highest, next the Hong Kong managers, and low-
on the meaning of items over a longer period of est of all the U.S. managers. On the Human heart-
time, or the results of practice effects. edness dimension, where higher scores reflect a
P. L. Kelley, Jacobs, and Farr (1994) looked at greater task orientation, U.S. managers scored
the MMPI profiles of almost 2,000 workers in the highest, followed by Hong Kong managers, and
nuclear-power industry. Each of the participants lowest of all the PRC managers. Only on Moral
had completed the MMPI several times as a result Discipline were there no significant differences
of regular employment procedures, most from among the three groups.
two to five times. The results indicated that the
obtained differences from one testing point to
BIOGRAPHICAL DATA (BIODATA)
another were small, but for some MMPI scales,
there were significant changes. The authors found There is a saying in psychology that the best pre-
that scale reliabilities tended to increase, and the dictor of future behavior is past behavior. This
result was more “normal” test profiles, perhaps does not necessarily mean that people will behave
because candidates became more “testwise.” in the future as they have in the past, but rather
that earlier behavior and experience will make
Values. As our world shrinks, business individu- some future behaviors more likely (Mumford
als find themselves more frequently dealing with & Owens, 1987). Past behavior can be assessed
persons from different cultures. Problems can through biographical data or what is called bio-
arise because what is highly valued in one coun- data. Biodata forms consist of a variety of items,
try may not be valued in another, yet individu- some of which are quite factual and verifiable
als tend to behave in accord with their own val- (e.g., what was your undergraduate GPA), and
ues, especially when making difficult or complex others that are more subjective and less verifiable
decisions. (e.g., what is your major strength?). Quite often,
Bond and a group of international colleagues such biodata forms are administered as job-
(Chinese Culture Connection, 1987) developed application blanks, that contain both descriptive
an instrument, the Chinese Value Survey (CVS), information (e.g., the person’s address and phone
to assess values of importance in eastern coun- number), and biodata items. The items or ques-
tries such as China. The CVS consists of 40 tions are then weighted or scored in some man-
items such as “filial piety,” “solidarity with oth- ner, and the resulting score used for classification
ers,” “patriotism,” and “respect for tradition.” or predictive purposes. The literature uses a num-
Each item is answered on a 9-point Likert-type ber of labels for biodata such as background data,
scale that ranges from “extreme importance” to scored autobiographical data, or job-application
“no importance.” The CVS assesses four factor- blanks.
analytic dimensions labeled as: (1) integration or Biodata forms have been used quite success-
social stability – i.e., being in harmony with one- fully in a variety of areas ranging from creativ-
self, one’s family, and colleagues; (2) Confucian ity to executive evaluations. In the early 1920s,
work dynamism, reflecting the teachings of Con- there were a number of studies in the literature
fucius such as maintaining the status quo and showing that such weighted application blanks
personal virtue; (3) human heartedness – having could be useful in distinguishing successful from
compassion and being people-oriented; (4) unsuccessful employees in a variety of occupa-
moral discipline – i.e., self-control. tions. During and after the Second World War,
In one study (Ralston, Gustafson, Elsass et al., such items were formulated in a multiple-choice
1992), the CVS was administered to managers format that was quite convenient and efficient.
from the United States, Hong Kong, and the Peo- In general, these measures show relatively high
ple’s Republic of China (PRC). The results indi- validity, little or no adverse impact, and job appli-
cated that on the Integration dimension there cants typically respond positively. One major
were no differences between U.S. and Hong Kong source of concern is to what degree they are sus-
managers, but both groups scored significantly ceptible to faking (Kluger, Reilly, & Russell, 1991).
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

366 Part Four. The Settings

Many biodata forms are in-house forms devel- eral steps discussed in Chapter 2. An item pool is
oped by business or consulting companies for first put together and potential items are reviewed
use with specific client companies and not avail- both rationally and empirically. Rational review
able for public browsing. Others are made avail- might involve judges or subject matter experts.
able to professionals. For example, the Candi- Empirical review might involve administering
date Profile Record (CPR) is a 145-item biodata the pool of items to samples of subjects and car-
form used to identify high-potential candidates rying out various statistical analyses. For exam-
for secretarial and office personnel. All items are ple, items that lack sufficient variability or show
multiple-choice and cover aspects such as aca- skewed response distributions might be elimi-
demic history, work history, work-related atti- nated. Intercorrelations and factor analyses can
tudes, and self-esteem. The CPR is untimed and be carried out to determine whether clusters of
can be administered individually or in groups, items that conceptually ought to go together,
with an administration time of about 30 to 45 indeed do so statistically.
minutes. The CPR can be computer scored (Ped-
erson, 1990). For those who prefer to build their
Scaling Procedures
own, there is a vast body of literature on biodata
forms, including a dictionary of biodata items How is a set of biodata items turned into a scale?
(Glennon, Albright, & Owens, 1966). There are basically four procedures: (1) empirical
keying, (2) rational scaling, (3) factor analysis,
Items. Items for biodata forms typically reflect and (4) subgrouping.
two approaches: (1) items that center on a par-
ticular domain or criterion – for example, items 1. Empirical keying. This represents the most
that have to do with job-related knowledge and common method, and is essentially identical to
skills; (2) items that reflect previous life experi- the procedure used for personality tests such as
ences, such as prior job experiences or attitudes the MMPI and CPI, discussed in Chapters 4 and
toward particular work aspects. An example of 7. The pool of potential items is administered to
the first type of item might be the following: How a group or groups of subjects, and the scoring
many real-estate courses have you completed? is developed against some criterion. For exam-
The response options might go from none to five ple, we might administer the pool of items to
or more. An example of the second item might be: a large sample of job applicants, hire them all,
At what age did you begin working? The response and at the end of a year identify the successful
options might go from 14 and younger to vs. the unsuccessful through supervisors’ ratings,
“never.” records of sales, or some other criteria. Items in
the biodata protocols are then analyzed statisti-
The nature of biodata items. It is not quite clear cally to see which ones show a differential pattern
how biodata items differ from items found on of responding either in the two contrasted groups
other tests, such as personality tests, although (i.e., successful vs. unsuccessful) or against the
one difference is that biodata items do not rep- continuous criterion of amount of sales. Items
resent general descriptions of behavioral tenden- that correlate significantly with the criterion can
cies (e.g., I am a friendly person), but rather focus then be cross-validated and those that survive
on prior behavior and experiences in specified, form the scoring key.
real-life situations. Mael (1991) suggests that bio- One of the crucial aspects of this procedure is
data items pertain to historical events that may the criterion – its reliability, validity, and utility.
have shaped a person’s behavior and identity. Fur- To the extent that the criterion measures are well
thermore, he argues that biodata items should defined, measured accurately, and free of any con-
reflect external events, be limited to objective and founding aspects, to that degree will the empir-
first-hand recollections, be potentially verifiable, ical key work (Klein & Owens, 1965; Thayer,
and measure discrete events. 1977).
In most tests, of personality, for example,
Constructing a biodata form. In general, the where the key is constructed empirically, items
construction of biodata follows the same gen- are given unit weights – thus an item responded
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 367

to in the keyed direction is assigned a scoring unteering activities, leadership roles, participa-
weight of +1. Might we not enhance the use- tion in one’s religious faith, etc. The indirect
fulness or validity of our test if we used differ- approach also attempts to identify the infor-
ent weights for different items (or even for dif- mation that describes performance on the cri-
ferent response options to the same item), per- terion (e.g., sales volume) but an attempt is
haps to reflect each item’s ability to discriminate made to identify the psychological constructs
between the criterion group members (e.g., the that might underlie such a performance (e.g.,
successful employees) from the reference group sales motivation, feelings of inferiority, sociabil-
members (e.g., the job applicants). In fact, this ity, etc.). and items are developed to reflect such
is what is often done, and the result is called constructs.
a weighted biodata questionnaire (see Mumford Usually, items for rational scales are retained
& Owens, 1987 for a brief discussion of various based on their internal consistency – i.e., does the
procedures). item correlate with other items and/or total score.
Mumford and Owens (1987) review several A particular biodata form may have several such
studies of the reliability and validity of the empir- clusters of items. In general, a number of studies
ical keying approach and concluded that empiri- seem to suggest that such rationally developed
cally keyed biodata measures are among the best scales have adequate reliability and substantial
available predictors of training and job perfor- predictive validity, although studies comparing
mance criteria. In addition, empirically devel- rational vs. empirical scaling techniques show
oped biodata forms show considerable cross- the empirical scales to have higher initial cri-
cultural validity and a relative lack of ethnic terion related validity and similar results when
differences. cross-validated (Hornick, James, & Jones, 1977;
A major criticism of the empirical keying see below). This approach has not received much
approach, aside from its dependence on the cri- attention, although it would seem to be most use-
terion, is that the scoring pattern may lack psy- ful when content and construct validity are of
chological meaningfulness (construct validity). concern.
Thus we may find that an item dealing with
infrequent church attendance may discriminate 3. Factorial scales. Here the pool of biodata
between successful and less successful employees, items is administered to a sample of respon-
and we may be at a loss to theoretically explain dents and various factor analytic techniques are
why this item works. Those who focus on the used to determine the smallest number of dimen-
use of tests as a means of understanding human sions (i.e., factors) or groupings of items that are
behavior see this as a major limitation; those who also psychologically meaningful. Items that load
see tests as a way of predicting behavior are less (i.e., correlate) .30 on a particular dimension are
bothered. usually retained and scored either with unitary
weights (e.g., +1), or weights that reflect the size
2. Rational scales. Mumford and Owens (1987) of the loading.
identify two major strategies here, which they Given that most item pools are quite hetero-
label as the direct and the indirect approaches. geneous and that specific items do not correlate
In the direct approach, a job analysis is under- highly with each other, this approach is limited
taken to define the behaviors that are related to because its usefulness depends upon homogene-
differential performance. Items are then devel- ity of item content and high internal consis-
oped to presumably measure such behaviors tency. On the other hand, the results of a limited
as they appeared in the respondent’s earlier number of studies do suggest that factor-analytic
life. For example, let’s assume that part of results yield fairly stable factor structures across
being a successful life insurance salesperson time, across various groups, and even across
is that the individual occupies a highly vis- cultures. Mumford and Owens (1987) point to
ible role in the activities of the community. the paucity of studies on this approach, and sug-
We might develop biodata items that would gest that factorial scales may not be as effective as
address the respondent’s earlier role in his or empirically keyed scales, but may display greater
her community, such as participation in vol- stability and generality.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

368 Part Four. The Settings

4. Subgrouping. This is a somewhat compli- permits understanding of why the items work
cated technique, that also begins with a pool of as they do. Such empirical item-keying is based
items administered to a large sample of subjects. on the assumption that there is a linear relation-
Item responses are then analyzed separately for ship between the item option and the criterion.
men and for women, and a special type of factor Thus with a 5-point Likert-type response, the var-
analysis carried out to determine the basic com- ious options will be given weights of 1, 2, 3, 4,
ponents or dimensions. For each person then, and 5.
a profile of scores is generated on the identi- A second scoring alternative is called option-
fied dimensions. These profiles are then analyzed keying, where each response alternative is ana-
statistically to determine subgroups of individu- lyzed separately and is scored only if it correlates
als with highly similar profiles. These subgroups significantly with the criterion. For a Likert-type
are then analyzed to determine how they dif- item (or any item with several response options),
fer on the various biodata item responses. Much we might use contrasted groups and analyze the
of this approach has been developed by Owens frequency with which each response option was
and his colleagues, and most of the studies have chosen within each group. Options that statisti-
been carried out with college students, although cally differentiate the two groups would then be
the results seem quite promising (Mumford & scored plus or minus, and options that do not
Owens, 1987). differentiate the two groups are scored zero. The
advantage of the option-keying approach is that
Scoring. The scoring of empirical biodata forms it can reflect both linear and nonlinear relation-
is relatively simple and was first outlined by ships and may be more resistant to faking (Kluger,
Goldsmith in 1922. This is the empirical key- Reilly, & Russell, 1991).
ing method discussed in Chapter 4 concerning
personality inventories. A sample of individu- Empirical vs. rational. Biodata questionnaires
als is administered the pool of items, in this can be classified as either empirical or rational
case biodata items. The sample is then sepa- (Mosel, 1952). Empirical biodata questionnaires
rated into two or more subgroups on the basis are developed using criterion validity. Items are
of some criterion we wish to predict – for exam- typically multiple choice and the scoring weights
ple, life-insurance salespersons are divided into are based on the empirical relationship between
those who sell X amount of insurance vs. those the item and the criterion. Rational biodata ques-
who sell less. The responses of the two subgroups tionnaires are developed using content validity.
to the biodata items are then compared, and a Items usually require narrative responses that
scoring key formed on the basis of items whose typically focus on previous job experiences. The
responses distinguish the two groups. For exam- responses are evaluated by raters with the help of
ple, suppose that we find that 86% of our super predetermined standards.
salespersons indicate a large eastern city as their A basic question is to ask which method is
birthplace, whereas only 12% of our “poor” sales- better? T. W. Mitchell and Klimoski (1982) devel-
persons do so. That item would be subsequently oped a biodata form for real-estate salespersons,
scored in a positive direction to predict sales per- and administered it to a sample of more than
formance. Obviously, the procedure would need 600 enrollees in a course on real-estate princi-
to be cross-validated to eliminate what might be ples. Two scoring keys were then developed, one
chance results. based on an empirical approach, the other on a
Such biodata scales generally work quite well rational approach, against the criterion whether
and are relatively easy to develop. As with per- the individual had or had not obtained a license
sonality inventories like the MMPI and CPI, the to sell real estate.
pool of biodata items represents an open system – The empirical approach involved compar-
the same pool of items can be used to develop ing the response frequencies of the two groups
different scales. At the same time, biodata forms and translating these frequencies into scoring
have the same criticism that some psychologists weights. T. W. Mitchell and Klimoski (1982) illus-
aim at any empirically derived measure, that is, trate this procedure with one item that inquires
they lack a unifying theoretical background that about the person’s living arrangements:
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 369

Not
looked at this issue with mixed results; some stud-
Licensed Licensed Difference Weight
Do you:
ies have found substantial agreement between
own your 81% 60% 21 5 what the respondent reported and objective ver-
home? ification of that report, whereas other studies
rent a 3% 5% –2 –1 have found disagreement (e.g., Cascio, 1975; I. L.
house?
Goldstein, 1971; Weiss & Dawis, 1960).
rent an 9% 25% –16 –4
apartment? Mumford and Owens (1987) believe that
live with 5% 10% –5 –2 the discrepancy among studies may reflect the
relatives? methodology used, and that although there may
be a general self-presentation bias, there is accu-
The first two columns indicate the percentage
racy of item responses when there is no motive
of responses by those who obtained the license
for faking. They also indicate that there are
and those who did not. The third column is
three techniques that can be used to minimize
simply the difference. These differences are then
faking: (1) use items that are less sensitive to
translated into scoring weights (shown in col-
faking; (2) develop scoring keys that predict
umn four) using standard tables that have been
faking; and (3) use faking keys in the scoring (see
developed for this purpose. Thus an individual
Chapter 16).
who indicates he or she owns their home would
get +5 points, whereas a person who rents an
Reliability. The reliability of biodata items can
apartment would receive –4 points.
In the rational approach, the intent is to iden- be enhanced by a number of approaches:
tify items that measure a theoretically meaning- 1. Make the items simple and brief.
ful set of constructs. Item-response choices are 2. Give the response options on a numerical con-
then scored according to a hypothesized theoret- tinuum [e.g., What was your salary range in your
ical continuum. For example, in the item given last position? (a) less than $ 10,000; (b)$ 10,000
above, it might be theoretically assumed that as a to 15,000 etc.].
person becomes more economically independent
3. Provide an escape response option if all the
they move from living with relatives, to renting an
possible alternatives have not been included (e.g.,
apartment, to renting a house, and finally to own-
in the above item we might include the option “I
ing a home. In this approach, these four response
did not receive a salary”).
choices would be given scoring weights of 1, 2, 3,
and 4, respectively. 4. Make the items positive or neutral (for exam-
For each person in the sample and in a cross- ple, instead of saying “I was fired from my last job
validation sample, the two scores were correlated because . . . ,” we might ask, “What was the length
against the criterion of licensing, with the follow- of service in your last job?”) (Owens, Glennon,
ing results: & Albright, 1962).

Original Cross-validation A number of studies have indicated that well-


sample sample developed biodata questionnaires show relatively
Empirical score .59 .46 low item intercorrelations – i.e., the items are rel-
Rational score .35 .36
atively independent (e.g., Owens, 1976; Plag &
Note that the empirical approach was supe- Goffman, 1967). Therefore, we would expect rel-
rior to the rational even though the empirical atively low internal consistency coefficients, and
approach showed shrinkage (i.e., loss of valid- that indeed is the case, with typical coefficients
ity) upon cross-validation, whereas the rational in the range of .40 to .80. Test-retest reliability on
approach did not. the other hand, seems more than adequate, even
with rather substantial time intervals of several
Accuracy. Biodata items rely on the self-report years.
of the respondent as to their past behavior and
experiences. There is concern that item responses Validity. Because biodata forms differ from each
may be distorted by selective recall, either purpo- other, and are compared with a variety of crite-
sive or “unconscious.” A number of studies have ria, the issue of validity can be quite complex.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

370 Part Four. The Settings

In general, however, the literature indicates that however, when cross-validated, correlated signif-
biodata forms are valid in a wide variety of icantly (.36 and .42) with criteria of creativity.
settings and samples (see Mumford & Owens,
1987). Reilly and Chao (1982) reviewed a num- Biodata with unskilled workers. Scott and R. W.
ber of studies and found the average correlation Johnson (1967) reported an interesting study
between biodata forms and on-the-job produc- with unskilled workers in a small canning fac-
tivity to be .46. tory. The company experienced substantial losses
One might expect that the validity of biodata of unfilled contracts due to rapid employee
forms to be somewhat limited over a long period turnover – therefore “tenure” was of great impor-
of time. For example, a form developed in the tance and tantamount to worker effectiveness.
1950s to select successful candidates for a specific They first identified 75 workers who had been
position, might not work very well in the 1990s. at the factory for at least 6 months (long-tenure
Such indeed seems to be the case, although few group), and 75 who had been there for 1 month or
longitudinal studies exist. A notable exception is less (short-tenure group). Apparently, all of these
a study by A. L. Brown (1978) who investigated workers had filled out a job-application form
the long-term validity of a biodata scoring key that contained 19 items. From each of the two
developed in 1933 for a sample of life-insurance samples, 50 applications were selected randomly,
agents; the results some 38 years later indicated and item responses were analyzed. Differential
that little, if any, validity was lost. weights of 0, 1, or 2 were assigned to items as a
Two other findings need to be mentioned: function of their ability to discriminate the two
(1) significant black-white differences have not samples. Of the 19 items, 12 were given different
been obtained in most studies (Mumford & weights; a cross-validation with the 25 employ-
Owens, 1987); (2) typically, correlations between ment forms in each group not used in the initial
information provided by an applicant and analyses, indicated a hit rate of 72%, and a corre-
the same information obtained from previous lation coefficient between scores on the biodata
employers are very high, with coefficients in the form and tenure of .45.
.90s (e.g., Mosel & Cozan, 1952). A regression equation using just six items was
then developed:
Tenure = .30(age)
Biodata and creativity. Biodata has been partic- + 8.82(gender)
ularly useful in studies of the identification and
prediction of scientific competence and creativ- − .69(milesfromplant)
ity (e.g., Albright & Glennon, 1961; Kulberg & + 5.29(typeofresidence)
Owens, 1960; McDermid, 1965; Schaefer & Anas- + 2.66(numberofchildren)
tasi, 1968; M. F. Tucker, Schmitt Cline, & Schmitt, + 1.08(yearsonlastjob)
1967).
Tucker, Cline, and Schmitt (1967) for example, − 1.99.
administered a 160-item biographical inventory, Gender was scored as male = 0 and female = 1,
originally developed with NASA scientists, to a and type of residence was scored as 0 if the person
sample of 157 scientists working for a pharma- lived with parents or in a rented room and 1 if they
ceutical company. Information on a variety of lived in their own home. Using this regression
criteria was also collected; these included super- equation in the cross-validation groups yielded a
visory and peer ratings on creativity and over- hit rate of 70% and a correlation coefficient of .31.
all work performance, as well as employment Both the regression equation and a subsequent
records such as length employed by the company factor analysis point to two major themes: fam-
and number of salary increases. High intercorre- ily responsibility and convenience. Long-term
lations among the criteria indicated were inter- employees are married, provide for one or more
preted by the authors as reflecting the influence of dependents, live in their own home, and worked
a halo effect. Peer ratings and supervisory ratings a relatively long period of time in their last job.
on the same dimensions did not correlate very Unskilled females who live fairly close to work
highly. Subsets of items on the biodata inventory were likely to stay on the job at this factory longer.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 371

(For a biodata study on college students, see Some tasks seem to be used more frequently,
Holland & Nichols, 1964). such as leaderless group discussions and the in-
basket test.
ASSESSMENT CENTERS
Validity. Meta-analytic reviews support the
This approach was originally developed to select validity of assessment centers from a predictive-
military officers during World War II, but was validity perspective, but less so from a construct-
made popular in the 1950s by the American Tele- validity point of view. In other words, less is
phone and Telegraph Company, and refers to a known about why assessment ratings predict
method of assessment rather than actual physical later performance than the fact that they do
place (Thornton & Byham, 1982). The first step in (Gaugler, Rosenthal, Thornton, et al., 1987;
this method involves a job assessment, that is, an Klimoski & Brickner, 1987; N. Schmitt, Good-
analysis of what qualities, skills, dimensions, etc., ing, Noe, et al., 1984).
are relevant to a specific job. Then job candidates A number of researchers have reported lack of
are observed and assessed for a given period of evidence to support the convergent and discrim-
time, which may last several days. Multiple assess- inant validity of the specific dimensions assessed:
ment or simulation methods are used, including correlations within the same exercise are higher
leaderless group discussions, in-basket tests (see than the correlations for the same dimension
below), problem-solving exercises, as well as stan- across different exercises, that is, the multitrait
dard tests believed to be useful in inferring man- correlation coefficients are higher than the mul-
agerial skills and abilities (Brannick, Michaels, & timethod ones (e.g., Sackett & Dreher, 1982).
Baker, 1989). There are usually multiple asses- Others have shown that using behavior check-
sors whose judgments, typically in the form of lists increases the average convergent validity
ratings, are pooled to produce an overall evalua- and decreases the average discriminant validity
tion. Because of the time and expense involved, (Reilly, Henry, & Smither, 1990). Sackett (1987)
assessment centers are usually used for manage- points out that the content validity leaves much
rial and executive personnel, for either selection to be desired – it is not enough to pay attention to
or training purposes. (For an overview of what the construction of the exercises; how these exer-
assessment centers are all about, see Bray, 1982.) cises are presented and evaluated is also crucial.

Uniqueness of dimensions. The dimensions Why do they work? There is substantial evi-
and tasks used are not standard and vary from one dence that assessment centers are useful predic-
assessment center to another. Reilly, Henry, and tors of subsequent managerial success. The main
Smither (1990) for example, report eight dimen- issue is not so much whether they work, but why
sions that include leadership, problem solving, they work (i.e., predictive vs. construct validity).
work orientation, and teamwork. Two group Typical conclusions are that assessment cen-
exercises are briefly described. In one, the candi- ters are useful tools to predict the future success
dates had to arrive at an assembly procedure for a of potential managers, regardless of educational
flashlight or a simple electrical device. In another, level, race, gender, or prior assessment-center
the group was required to organize and plan an experience (Klimoski & Brickner, 1987). They
approach to construct a prototype of a robot, after seem to work in a wide variety of organizational
being shown a model and given tools and parts. settings ranging from manufacturing companies
Three assessors were assigned two candidates to educational and governmental institutions;
each to observe and rate. T. H. Shore, Thorn- they can be useful not only for selection and pro-
ton, and L. M. Shore (1990) used 11 dimensions motion purposes, but also for training, for career
falling into two major domains: the interpersonal planning, and for improving managerial skills.
style domain that included “amount of partici- Why do these centers work? The traditional
pation” and “understanding of people,” and the answer is that they are standardized devices to
performance style domain that included “orig- allow assessment of traits that are then used to
inality,” “work drive (i.e., persistence),” and predict future success on the job. They work
“thoroughness of performance.” because they do a good job of measuring and
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

372 Part Four. The Settings

integrating information regarding a person’s results. Typical correlations are only in the high
traits and qualities. However, the evidence sug- .10s and low .20s, suggesting either that individu-
gests that assessment center ratings do not reflect als are not consistent in their performance across
the dimensions they are supposed to and may forms, or that different versions of the in-basket
at best represent a global rating, something like are eliciting different behaviors.
a “halo” effect. Other alternative explanations Split-half reliabilities have also been disap-
range from the hypothesis that promotions in pointing, with typical reliability coefficients in
organizations may be partially based on ratings of the .40s to .50s range.
assessment-center performance (i.e., those who
do well get promoted) to the hypothesis that the Validity. Schippmann, Prien, and Katz (1990)
ratings obtained reflect the level of intellectual argue that the available literature confuses face
functioning of candidates. validity with content validity. Thus, although a
number of studies suggest that the in-basket pro-
cedure is valid because of its contents, none uses
The In-Basket Technique
content validity procedures to construct an in-
Probably the best known situation or simula- basket form.
tion exercise used in assessment centers is the Criterion validity is a bit more complex to
in-basket technique (Frederiksen, 1962). In this summarize. There are various criteria that can
technique the candidate is given an “in-basket” be used, such as supervisory ratings, ratings
that contains letters, memos, records of phone obtained as part of the assessment-center eval-
calls, etc., and the candidate is asked to handle uation, or various indices of career progress such
these as he or she would in an everyday job. The as occupational title and salary. A very wide range
behavior (i.e., the actions and decisions the can- of correlation coefficients are reported in the lit-
didate makes) is scored according to content and erature that range from nonsignificant to coef-
to style; content refers to what was done and style ficients in the .50 to .70 range. Clearly, some of
to how a task was completed. Content scoring the evidence supports the validity of the proce-
procedures typically involve a simple counting – dure, but just as clearly there is need to study what
for example, the number of memos completed or aspects are confounding the results.
number of decisions made. Stylistic scoring pro- Very few studies have looked at the construct
cedures evaluate the quality of the performance, validity of this procedure. Studies that use factor
the quality of the decisions made. Unfortunately, analysis yield a fair amount of congruent results,
the technique is not standardized and a variety of but the evidence is too sparse to come to any
assessment materials and scoring procedures are conclusions. In general, Schippmann, Prien, and
used. Katz (1990) conclude that the use of the in-basket
technique is based more on “belief ” about its
Reliability. Three types of reliability have been value than on empirical evidence.
investigated: interrater reliability, alternate-form
reliability, and split-half reliability. Interrater An experimental study. Brannick, Michaels,
reliability coefficients vary substantially: in one and Baker (1989) used two alternate forms
study, for example, they varied from .47 to .94 of the in-basket exercise. Each in-basket pre-
and in another study from .49 to .95. Median rs sented a series of tasks such as phone messages,
however appear to be fairly substantial: in one complaints from managers, and schedule con-
study, the median r was .80, and in another it flicts that the candidate needed to act on. The
was .91. Schippmann, Prien, and Katz (1990), responses were scored on five dimensions: orga-
concluded that, despite the fact that none of the nizing and planning; perceptiveness; delegation;
studies they reviewed used more than 4 raters leadership; and decision making. Scoring keys for
and sample sizes are typically quite small, scorers these dimensions were developed by collecting
and raters are responding fairly consistently to the responses of about 20 students and managers
the data. and rating each response as positive, neutral, or
Few studies have looked at alternate-form reli- negative on each of the five dimensions. A prior
ability, but the few that have report fairly abysmal study indicated that the interrater correlation for
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 373

this scoring procedure was .95, so it appears to be shows the relationship between a predictor (such
reliable. as scores on a test) and a criterion (such as
A sample of 88 business students, both gradu- amount of sales), and allows the determination of
ate and undergraduate, were then administered the likelihood that a person with a specific score
both forms. Interjudge reliability of the scoring of on the predictor will achieve a particular level of
the two forms ranged from a low of .71 for orga- performance on the criterion.
nizing and planning, to a high of .89 for decision
making, and a high of .91 and .94 for the total
Banding
sum of Form A and of Form B. Internal relia-
bility coefficients ranged from a low of .35 to a When a sample of applicants is given a test, they
high of .72, with only one of the 12 coefficients are then typically ranked on the basis of their test
(two forms with five dimensions and one total scores, and the top-scoring applicants are then
per form) above .70. More surprising was that selected. Because of some of the issues discussed
scores across the two forms for the same dimen- in Chapter 11, such a procedure may have adverse
sions did not correlate very highly – the coeffi- impact on members of specific minority groups.
cients ranged from .21 to .43. As the authors indi- Ideally, a selection method should reduce adverse
cate, these results call into question the validity of impact but not reduce the utility of such a proce-
inferences about managerial traits that are made dure. One strategy is to adjust test scores to reduce
on the basis of in-basket scores. or remove group differences, such as using sep-
arate norms for blacks and for whites. However,
such a strategy is not only illegal in that it is for-
ILLUSTRATIVE INDUSTRIAL CONCERNS
bidden by the Civil Rights Act of 1991 (see Chap-
ter 15), but it is difficult to support rationally.
Work Samples
Another approach is that of test score band-
A work sample is a replica of the position for ing (Cascio, Outtz, Zedeck, et al., 1991). Banding
which someone has applied. For example, some- involves specifying differences in test scores that
one applying for a position as typist might be might be observed for two individuals who do
asked to type a letter from dictation. Clearly for not differ on the construct measured by the test
some jobs it is relatively easy to put together a (K. R. Murphy, 1994). For example, if the width
work sample, but for others it is a bit more dif- of a test score band is 8 points, then someone who
ficult. Work samples are sometimes called situa- scores 28 should not be treated differently than
tional exercises, particularly when only a specific someone who scores 22 because a difference of
aspect of a job is being sampled. Thus, with can- less than 8 points in this example, does not reli-
didates for a management position we might put ably indicate a difference on the construct mea-
them through situational exercises where they sured. Note that this is very much like the concept
might need to show their ability to solve a person- of standard error for the difference between two
nel problem, their evaluation of a written docu- means, discussed in Chapter 3, and indeed the
ment such as a contract, or their ability to write formula for computing the bandwidth involves
a business letter. the reliability of the test and the standard error
of measurement (see K. R. Murphy, 1994, for the
actual formula).
Expectancy Tables
Basically, banding defines a range of scores that
Industrial psychologists are often concerned with should be treated as if they were identical; the
predictive validity. In addition, they need to com- selection decision must be made using additional
municate such validity results to individuals who information. For example, let’s assume we have
are not necessarily trained in statistics or psycho- administered test X, where the highest obtained
metrics who may have difficulty understanding score is 120, and we have computed the band-
the meaning and limitations of a correlation coef- width to be 5. Then scores between 120 and 115
ficient. As discussed in Chapter 3, a useful device are treated as essentially equal. Let’s say we need
to communicate such results is an expectancy to hire 8 applicants and we have a pool of 12 who
table, essentially a graphic display of data that scored 115 and above. In this case we would need
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

374 Part Four. The Settings

to use additional information to select the 8 out job-element ratings, with the result that for five
of the 12. Suppose, however, only 5 applicants of the seven job-element ratings there were spe-
scored above 115. We would hire all of them as cific tests that correlated significantly with each
well as 3 applicants from the next bandwidth, job-element dimension. An application of these
namely 110 to 115. results to 13 new employees showed the system
Banding is not without criticism. Quite obvi- to be working well, and that consideration of
ously, the procedure requires that we ignore the specific job elements was better than the use of
numerical differences within a band but not an overall criterion. (For other examples of syn-
across bands – in the example above, differences thetic validity studies, see Drewes, 1961; Griffin,
between 115 and 120 are ignored, but 114 is 1959).
treated as different from 115. This of course is
not unusual. In college classes such a procedure is
Time Urgency
used implicitly when numerical scores are trans-
lated into letter grades – scores of 93 to 100 might How persons keep track of and use time has
for example, all be given As, while a score of 92 been a major topic of interest for industrial-
might be equated to a B (F. L. Schmidt, 1991). organizational psychologists. One’s time orienta-
tion seems to be related to a significant number of
behaviors and outcomes, including the possibil-
Synthetic Validity
ity of experiencing greater stress and subsequent
Lawshe introduced the concept of synthetic valid- cardiovascular disease.
ity to refer to the notion that tests can be validated, Time urgency, or the tendency to perceive time
not against a single overall criterion, but against as a scarce commodity, seems to be a partic-
job elements (Guion, 1965b; Lawshe, 1952; Law- ularly important variable. Like other complex
she & Steinberg, 1955). These job elements may psychological variables, time urgency is proba-
be common to many dissimilar jobs. By using bly multidimensional. In one study, seven such
tests that are valid for specific job elements, one dimensions were identified, including eating fast,
can create a tailor-made battery of tests, even for impatience, and doing many things at once
a new unique job. (Edwards, Baglioni, & Cooper, 1990).
Guion (1965b) presents an interesting appli- Landy, Rastegary, Thayer, et al. (1991) took
cation of this method. He studied an electrical the items from four different scales, eliminated
wholesaler firm composed of 48 people, from duplicate items and eventually developed a “new”
president to stock boy. In no case were more than scale of 33 items, all responded to in a Likert-
3 persons doing the same job. Clearly, a tradi- type format. These items seemed to fall into five
tional approach where test scores are related to a factors labeled as: competitiveness, eating behav-
criterion would not work here. A detailed analy- ior, general hurry, task-related hurry, and speech
sis of the various jobs and job descriptions indi- pattern. These five factors had coefficient alpha
cated that there were seven major job elements, reliabilities ranging from a low of .69 to a high
such as sales ability, creative business judgment, of .89, and intercorrelated with each other low to
leadership, and work organization. The presi- moderate, with correlation coefficients ranging
dent and vice president of the company were from .17 to .39.
then asked to rate the employees on the seven In a second study, Landy, Rastegary, Thayer,
major job elements, as well as to give an overall et al. (1991) undertook to develop a new test of
rating. The procedure used attempted to elim- time urgency by using behaviorally anchored rat-
inate any halo ratings, and only employees for ing scales (P. C. Smith & Kendall, 1963). Using
whom a particular job element was relevant were a brainstorming technique, nine dimensions of
rated on that dimension. The interrater reliabil- time urgency were developed, such as “aware-
ity was quite high, ranging from a low of .82 ness of time,” “eating behavior,” and “speech
to a high of .95, with most of the coefficients patterns.” For each of these nine dimensions,
in the high .80s. Employees were subsequently specific behavior anchors were then written; these
administered a test battery that yielded 19 scores. were then submitted to various logical and sta-
These scores were then compared to the seven tistical analyses, until there were two parallel
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 375

rating scales for each of six dimensions, and a Differences in total scores between short-
single-rating scale for a seventh dimension (two tenure and long-tenure employees were signif-
other dimensions were dropped). The result- icant, within each racial group, but differences
ing 13 scales were administered to a sample of across groups were not significant. Point-biserial
introductory psychology college students, who correlations between application-blank scores
were retested 4 weeks later. Test-retest reliabilities and the dichotomy of long vs. short tenure was.77
ranged from a low of .63 to a high of .83, with most for the nonminority group and .79 for the minor-
values in the acceptable range. Parallel-form reli- ity group. These coefficients dropped slightly to
abilities ranged from .60 to .83, again also in the .56 and .58 on the cross-validation samples, as
acceptable range. Inter-correlations among the expected. By using an expectancy chart of the
seven dimensions are quite low, with the major- combined data, Cascio (1975) could increase the
ity of coefficients in the .10s or below. The authors predictive accuracy to 72%, as opposed to the
also present evidence for the construct validity of base rate of 52%.
the scales.
Occupational Choice
Excessive Turnover
How are occupational choices made? There are
Excessive turnover is a serious problem for many a number of theories that attempt an explana-
companies. Whether an employee quits voluntar- tory framework. Some emphasize the economic
ily or fails a probationary period or is fired, the aspect – people choose jobs that give them
economic loss to the employer can be substantial. the best economic advantage. Others emphasize
There are a number of ways to combat this prob- chance – most occupational choices are made on
lem, but one strategy is to identify such “short the basis of serendipity and happenstance. Still
tenure” personnel before hiring. One method is others emphasize the values and goals that a per-
through the use of personal-history items, often son learns from society and family. Many psy-
found on standard application blanks. chological theories attempt to explain vocational
Schuh (1967) reviewed the available literature choice in terms of the differences in aptitudes,
and concluded that biographical items had pre- interests, and personality traits among people.
dictive validity as related to turnover, although his Ginzberg (1951) conceptualized vocational
conclusion was later criticized (e.g., D. P. Schwab choice as a developmental process spanning the
& Oliver, 1974). entire period of adolescence, and felt that specific
Cascio (1975) studied the application blanks of events or experiences played a major role. Other
160 female clerical employees of a large insurance theorists also have emphasized the role of specific
company that was experiencing a 48% turnover life experiences, and the usefulness of biodata in
rate within a year after hire – i.e., of every two predicting vocational choice. A typical study is
employees hired only one would remain longer that of Neiner and Owens (1985) who adminis-
than a year. Of the total group, 80 were minori- tered a 118-item biographical inventory to enter-
ties, primarily Spanish speaking, and 80 were ing college freshmen. Some 3 to 5 years after grad-
not. Within each group 40 had voluntarily ter- uation, these students were contacted and asked,
minated within a year (short tenure) and 40 as part of a questionnaire, their current job title
had remained on the job 12 months or longer and job description. These jobs were then coded
(long tenure). Sixteen items were chosen from into one of the six groupings proposed by Hol-
the application blank, on the basis that previous land – i.e., artistic, investigative, conventional,
research had shown they were valid predictors of realistic, social, and enterprising (see Chapter 6).
tenure. These items were scored, and the result- A statistical analysis indicated that the bio-
ing total scores analyzed for minority and non- data factor dimensions did correlate significantly
minority group members separately, and cross- with post-college job choice. For example, males
validated. Ten items survived the item analyses, who entered investigative type jobs tended to
including such items as age, marital status, edu- score higher on the biodata dimensions of aca-
cation, tenure on previous job, and location of demic achievement (e.g., higher standing on
residence. academic grades; more successful in academic
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

376 Part Four. The Settings

situations), social introversion (e.g., fewer dates, scores on the promotion examination ranged
fewer casual friends), and scientific interest (e.g., from .28 to .67, with a median coefficient of .45.
enjoyed science courses, the use of scientific The obtained correlation coefficients between
apparatus). actual promotion examination scores and pre-
dicted examination scores ranged from a low of
.10 to a high of .82, with a median of .48. For only
TESTING IN THE MILITARY
one Navy occupation was the correlation not sta-
As you might imagine, there is a substantial tistically significant, indicating that scores on the
amount of testing that takes place in the military. Basic Test Battery were useful in the assignment
We cover a few examples, just to illustrate some of Navy personnel.
of the basic issues (for a brief history of testing in
the military, see Haney, Madaus, & Lyons, 1993).
The Army Selection and Classification
Project (Project A)
The Navy Basic Test Battery
Between 1983 and 1988, Project A was devel-
In 1943, during World War II, a battery of tests oped to create a selection and classification sys-
known as the Basic Test Battery was developed by tem for all 276 entry level positions in the United
Navy psychologists for the purpose of classifying States Army. The primary instrument used in this
enlisted personnel into occupational specialties; project was the Armed Services Vocational Apti-
originally there were nine tests, but three were tude Battery (ASVAB), made up of 10 subtests; 4
dropped, and three were combined into one. The of the subtests comprise a test of their own called
remaining four subtests included in the battery the Armed Forces Qualification Test (AFQT). For
were: (1) The General Classification Test, orig- each of the entry-level positions, critical scores on
inally developed as a test of verbal reasoning; the appropriate subtests were developed, and if
(2) the Arithmetic Reasoning Test, containing an individual scored above those critical scores,
verbally stated problems that require arithmetic that person would be assigned to the position,
solutions; (3) the Clerical Aptitude Test con- depending on Army needs, individual preference,
taining items that require alphabetizing, name and other aspects. Basically, Project A was one of
checking, and letter checking (i.e., are these two the largest validational studies ever conducted;
names or words the same); (4) A Mechanical Test, 19 of the entry-level positions, such as “motor
composed of aptitude and knowledge sections, transport operator” and “combat engineer” were
designed to measure the ability to apply mechan- studied with samples of 500 to 600 individuals
ical principles to the solution of problems and in each position (see J. P. Campbell, 1990, for
to assess knowledge of mechanical and electrical overall details, McHenry, Hough, Joquam, et al.,
tools. 1990, for specific validity results and Drasgow &
This test battery has been used to select enlisted Olson-Buchanan, 1999, for the history of a com-
personnel for training as apprentices in a variety puterized version of the ASVAB).
of naval occupations. The primary validational
criteria for these tests have been final school-
The Dot Estimation Task (DOT)
grade averages for the trainees at the end of their
training period; these have varied from .03 to .72. In 1983, the United States Air Force Human
Kuder-Richardson reliability ranged from .77 to Resources Laboratory began administering an
.96 with a median of .87. experimental test battery to new flight-training
Merenda (1958) reported a study of two sam- candidates to select and classify pilots (Lambirth,
ples of candidates for advancement to petty offi- Gibb, & Alcorn, 1986). One component of the
cer. The first sample was used to derive multiple- battery was the Dot Estimation Task (DOT) in
regression equations, and the second sample to which the subject is simultaneously presented
cross-validate such regression equations against with two fields containing an arbitrary number
the criterion of scores on a promotion exam- of dots ranging from 1 to 50; one of the two fields
ination. The obtained correlation coefficients contains one more dot than the other. The candi-
between scores on the Basic Test Battery and date is required to determine as rapidly as possible
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 377

Original stimulus Rotated Mirror image


FIGURE 14–2. Example of stimulus figure used by Dror, Kosslyn, and Waag (1993).

which of the two fields contains the greater num- experiments. In one of them, a shape similar
ber of dots. This is a self-paced test lasting 6 to that in Figure 14.2 was presented on a com-
minutes and containing a maximum of 50 trials. puter screen to the individual pilot. Then a sec-
The DOT was intended to measure compulsivity ond shape was presented, and the pilot had to
vs. decisiveness, with the underlying assumption decide as fast as possible whether the second
that compulsive individuals will either actually shape represented a rotation of the first or a
count the dots or vacillate in their estimates and mirror-reversed version. Sixty-four such trials
thus have longer response time and attempt fewer were conducted. The authors did not consider
trials in a given time period. Decisive individu- reliability of the task directly, although they did
als will presumably work quickly and make their eliminate any responses where the response time
estimates with some rapidity. Lambirth, Gibb, was greater than 2.5 times the mean of the other
and Alcorn (1986) administered the DOT and responses. They did ask the question whether the
a personality measure of obsessive-compulsive task was measuring what “they thought it was,”
behavior to 153 students and retested them 4 and answered by looking at whether response
weeks later. The reliability of the DOT was low: time varied with degree of rotation (does it take
.64 for number of trials attempted, .46 for num- longer to analyze a rotated figure that has been
ber of correct trials, and .64 for the number of rotated a lot than one that has been rotated a
incorrect trials. All validity coefficients – that is little?). The data did indicate that in fact, the task
all correlations between DOT scores and scores was assessing mental rotation ability, and they
on the obsessive-compulsive measure were essen- found that pilots were faster overall than non-
tially zero, thus indicating that the DOT cannot pilots, but equally accurate.
be used as a nonverbal indicator of obsessive-
compulsive traits.
PREDICTION OF POLICE PERFORMANCE

Visual-spatial abilities of pilots. In this text- A substantial number of studies have assessed
book, we have emphasized paper-and-pencil law-enforcement candidates in an effort to iden-
tests. Tests however come in many forms, and tify those who are ill suited for such jobs. There
sometimes the line between what is a test and are two major reasons why testing might be
what is an experimental procedure can be quite useful: the economic issue, to screen out recruits
blurred. In Chapter 1, however, we suggested that who subsequently might fail the training, the pro-
one useful way to think of a test is as a labora- bationary period, or actual job performance, and
tory procedure, and so we can consider labora- the legal-social issue, to identify those candidates
tory procedures as tests, that is, needing to show who might misuse their training and position.
reliability and validity. The process of selecting the best qualified
Dror, Kosslyn, and Waag (1993) studied the applicants has been, and in many instances still
visual-spatial abilities of pilots in a series of five is, based on subjective evaluation and clinical
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

378 Part Four. The Settings

impressions often made by administrators rather ally terminated, compared with 62% and 66% for
than mental-health specialists. A 1967 Presiden- the MMPI (for critical reviews see Bolton, 1995;
tial advisory committee on law enforcement was Lanyon, 1995).
critical of such techniques and recommended the
use of clinical psychologists and psychiatrists, Racial differences. Baehr, Saunders, Froemel
as well as batteries of appropriate assessment et al. (1971) reported on a study to design and val-
devices (J. J. Murphy, 1972). idate a psychological test battery to select police
patrolmen. The battery consisted of 14 tests that
Some problems. Once again the criterion prob- covered four broad domains: motivation, men-
lem looms large. Police departments vary sub- tal ability, aptitude, and behavior. For motiva-
stantially, and the demands made on an urban tion, measures of background and experience
police officer may be quite different from those were used. Mental ability used standard tests
made on a rural county sheriff. The base-rate of reasoning, language facility, and perception.
problem also complicates the picture. Inappro- Aptitude covered “creative potential” and social
priate use of force, for example, is actually a insight. And behavior was assessed by various
relatively rare behavior despite the publicity it aspects of temperament and personality. The test
receives. battery was administered to 540 patrolmen who
Police departments have increasingly turned to represented both the upper-rated third and the
the use of standardized tests with the MMPI as a lower-rated third in a large urban police depart-
very popular choice, even though Butcher (1979) ment. The sample reflected the racial compo-
in a review of the MMPI literature reported sition of the department: 75% white and 25%
that the relationship of MMPI scores to job per- black.
formance was modest at best. In addition, it The authors attempted to answer four basic
should be kept in mind that the MMPI was not questions: (1) Are there any racial differences
designed to select police officers. Some authors on the predictor (i.e., test) variables? Signifi-
(e.g., Saxe & Reiser, 1976) have argued that most cant differences in group means were obtained
police applicants score in the normal range on in three of the four areas, with white subjects
the MMPI and hence the predictive value of the scoring higher on motivation, mental ability, and
MMPI is limited. Other authors (e.g., Saccuzzo, behavior – only on the aptitude area was there
Higgins, & Lewandowski, 1974), while agreeing no difference. (2) Are there any racial differ-
that scores are often normal, also argue that the ences on the criterion (i.e., job performance)
pattern and elevation of the MMPI profile are dis- variables? The criteria used here were semian-
tinguishable against various criteria, and there- nual performance ratings by supervisors, as well
fore the MMPI can still be used. as variables such as number of arrests made
and disciplinary actions taken against the patrol-
man. Only four criterion variables showed con-
The Inwald Personality Inventory (IPI)
sistent results. White and black officers did not
The Inwald Personality Inventory (Inwald, Knatz, differ on tenure (i.e., amount of time on the
& Shusman, 1983), a 310-item true-false inven- police force) or absenteeism. Black patrolmen
tory yielding 26 scales, was developed specifically made significantly more arrests and had signif-
for use by law-enforcement agencies in selecting icantly more disciplinary actions taken against
new officers. The IPI attempts to assess the psy- them, a result that might simply reflect the area
chological and emotional fitness of recruits, with of their patrol. (3) Do such differences lead to
items that are relevant to police work. In addition discrimination? Using complex statistical proce-
to the 26 basic scales, the IPI yields a “risk” score, dures, the authors found that the best predic-
which is a weighted combination of several IPI tive validity was obtained when the test data on a
scales. Studies of the IPI have yielded favorable given racial group were used to predict criterion
results (e.g., Scogin, Schumacher, Gardner, et al., scores for members of that same group; the poor-
1995). The manual reports that in one study, the est predictive validity was obtained when predict-
IPI predicted accurately 72% of the male officers ing across racial groups. When the groups were
and 83% of the female officers who were eventu- treated separately statistically, better prediction
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 379

resulted for the black group. The authors argued the task is, in the context of a specific occupa-
that these results are good evidence for com- tion or task. The JCI takes about an hour or less
puting separate prediction equations and that to administer, is easily understood by individu-
since the same level of performance is associ- als with limited educational skills, and provides
ated with a lower test score for the black group wide coverage of job elements.
as compared to the white, a greater number The JCI was administered to 100 individuals
of black applicants can be hired. (4) Which in eight different job groups falling under one of
tests and variables are most related to effective- two occupational areas – engineering jobs, such
ness as a police officer, regardless of race? The as drilling machine operator, and clerical jobs,
authors were able to describe the more successful such as mailroom clerk. For six of the eight job
patrolman as having early family responsibilities groups, it was possible to have supervisors com-
(i.e., married and established a home), as hav- plete the JCI as well. The overall pattern of results
ing stability in both childhood and occupational indicated a high level of agreement between the
environment, and better-than-average health. ratings given by the job holders and their super-
The successful patrolman has at least average visors, with most correlation coefficients above
mental functioning and good visual-perceptual .50. Correlations between composite ratings for
skills. In interpersonal situations he tends to the entire JCI were higher in general than the cor-
be cooperative rather than withdrawn or aggres- relations for the individual sections of the JCI.
sive. He has good impulse control, is personally A comparison of the engineering vs. clerical
self-reliant, and has a work orientation rather occupational areas indicated significant differ-
than a social orientation. In a word – stable. ences on both the total score and three of the
five JCI sections. Clerical jobs required a wider
variety of tools and equipment and used more
EXAMPLES OF SPECIFIC TESTS
mathematical and communication components.
In addition, significant differences were obtained
The Job Components Inventory (JCI)
across job groups within the same occupational
In England, Banks, Jackeon, Stafford, et al. (1983) area.
developed the Job Components Inventory (JCI),
designed to assist young people in their voca-
The Supervisory Practices Test (SPT)
tional preparation and training. The JCI con-
tains five sections: (1) Tools and equipment; cov- Bruce and Learner (1958) developed the SPT to
ers the use of 220 tools and pieces of equip- predict success in supervisory, managerial, and
ment ranging from small hand tools such as chis- executive work. They first prepared a 100-item
els to large pieces of equipment such as forges. experimental form of the test with items drawn
(2) Perceptual and physical requirements; 23 from the literature as well as from individuals in
items that deal with such aspects as dexter- managerial and supervisory positions. The items
ity, physical strength, and reaction time. (3) are essentially miniature vignettes followed by
Mathematical requirements; 127 items that cover several choices; for example, “If I were too busy to
mathematics up to algebra and trigonometry, teach a new procedure to a subordinate, I would:
with emphasis on practical applications such (a) ask the subordinate to learn it on his own; (b)
as work with drawings. (4) Communication have another person teach it to the subordinate;
requirements; 22 items that deal with the prepa- or (c) make time in my next day’s schedule.”
ration of reports and letters, dealing with com- The form was administered to 285 subjects
plaints, and other interpersonal communication who included 51 executives, 71 managers, and
aspects. (5) Decision making and responsibility; 163 persons in a variety of nonsupervisory jobs.
nine items that deal with decisions in the context Of these 100 items, 64 were retained on the basis
of work. of a dual criteria: (1) each of the distractors was
It is not clear what response the examinee endorsed by at least 15% of the sample; and (2)
makes, but usually such inventories ask the the item discriminated between supervisors and
candidate to indicate whether a particular task is nonsupervisors. A cross-validation with a second
performed, how frequently, and how important sample resulted in a final form of 50 items.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

380 Part Four. The Settings

Item weights were determined by a rather com- both job requirements across many jobs and the
plicated procedure. A new sample of 200 supervi- 100 or so occupation-specific tests that were then
sors and 416 nonsupervisors was tested. The sam- available led to the identification of nine basic
ple was then divided into two ways. The first was aptitudes that seemed to be relevant to many
supervisors vs. nonsupervisors. The second was jobs. Thus what was unique about the GATB
in accord to the majority response for each item; at its inception was that it was a comprehensive
that is, for each subject the number of responses battery – clients did not need to be assessed with
agreeing with the majority was calculated. Then a variety of single purpose tests. Drawing on
the entire sample was dichotomized into the existing tests, a battery of 12 subtests was put
50% choosing the greatest number of majority together in 1947. Table 14.2 lists the nine dimen-
responses, and the 50% choosing the least num- sions and the 12 tests that make up the battery.
ber of majority responses. The majority view Initially, there were two forms of the GATB, Form
was thus combined with the supervisors’ view A reserved for use in employment offices and
to develop positive weights for “good” responses, Form B used for retesting, validation research,
and negative weights for “poor” responses. and made available to other authorized users. In
1983 two other forms, C and D, were developed.
Reliability. For a sample of 112 nonsupervisors A Spanish version has been available since 1978.
who were retested with an interval of seven The GATB consists of 12 separately timed sub-
months, the obtained r was .77. The authors tests that are combined to form nine aptitude
argued that for this sample the range of test scores scores. Eight of the tests are paper-and-pencil,
was restricted (although they varied from 11 to and four are apparatus tests. Six subtests are
126), and a more correct estimate of reliability intended to measure aptitudes that involve speed
was .86. Split-half reliability for a sample of 177 of work as a major component. The score on each
individuals was .82. subtest is the number correct, with no correction
for guessing.
Validity. The final form was administered to var- The individual subtests are listed below:
ious samples of supervisors vs. nonsupervisors 1. Name comparison. This subtest contains
across a wide range of industries, and in each two columns of 150 names, and the examinee is
case the mean scores between groups were signif- required to indicate whether each pair of names
icantly different. In one sample of foremen who are the same or different. The subtest has a time
were also ranked by two supervisors on the basis limit of 6 minutes.
of overall competence, the correlation between 2. Computation. Composed of 50 multiple-
SPT scores and supervisor ratings was .62. In a choice items that require arithmetic, such as addi-
sample of 174 individuals, the SPT and a test tion or multiplication of whole numbers. Each
of intelligence were administered. Correlations item contains 5 choices, the last being “none of
between the SPT and various portions of the test these.” The time limit is 6 minutes.
of intelligence ranged from a low of .18 to a high 3. Three dimensional space. Composed of 40
of .35 (median r of about .25), indicating a low items, each with a stimulus figure and four
and positive correlation between the SPT and options. The stimulus figure represents a flat
intelligence. Finally, in another sample, the SPT piece of material that could be folded or bent
was correlated with another supervisory practices into a three-dimensional figure – the four options
measure. The overall correlation was .56 indicat- present such figures, with only one correct. The
ing significant overlap between the two measures. time limit is 6 minutes.
4. Vocabulary. Contains 60 multiple-choice
items, each item with four choices. Correct
The General Aptitude Test Battery (GATB)
answer involves which two choices mean the same
In the years 1942 to 1945, the U.S. Employ- (or the opposite). The time limit is 6 minutes.
ment Service decided to develop a “general” apti- 5. Tool matching. Composed of 49 items, each
tude battery that could be used for screening containing a stimulus drawing and four choice
applicants for many occupations, with an empha- drawings, of simple shop tools, such as pliers and
sis on occupational counseling. An analysis of hammers. Different parts of the stimulus drawing
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 381

Table 14–2. Composition of the General Aptitude Test Battery


Factor symbol Aptitude Specific subtests
G General intelligence Three-dimensional space;
vocabulary; arithmetic reasoning
V Verbal aptitude Vocabulary
N Numerical aptitude Computation; arithmetic reasoning
S Spatial aptitude Three-dimensional space
P Form perception Tool matching; form matching
Q Clerical perception Name comparison
K Motor coordination Mark making
F Finger dexterity Assemble and disassemble
M Manual dexterity Place; turn

are black or white, and the examiner matches the board that contains two sets of 50 holes and a
identical drawing from the available choices. The supply of small metal rivets and washers. In the
time limit is 5 minutes. assemble subtest, the examinee assembles rivets
6. Arithmetic reasoning. Twenty-five verbal and washers and places each unit in the desig-
mathematical problems (e.g., If I buy eight nated hole as rapidly as possible. The subtest has
oranges at 25 cents each . . .), with possible a 90-second time limit. In the disassemble subtest,
answers given in multiple-choice form with 5 rivets and washers are disassembled and placed
choices, the fifth being “none of these.” The time in designated areas; the time limit is 60 seconds.
limit is 7 minutes. Both tests are measures of finger dexterity.
7. Form matching. Composed of two groups
of variously shaped line drawings. In one group, Scoring. Protocols can be hand scored locally
each drawing is identified by a number, and in or scored by computer services. For each sub-
the other by a letter. The drawings are jumbled test, the raw score, defined as number of items
up, and the examinee needs to match each figure correct, is calculated. Raw scores are then con-
in the second group with the identical figure in verted to scores that are referenced on the norm-
the first group. There are 60 such drawings, and ing population. The specific conversion depends
the time limit is 6 minutes. upon the form of the GATB used, the type of
8. Mark making. Consists of 130 drawn answer sheet used, and the purpose for which
squares; for each square the examinee must make the subtest score will be used. For example, the
a mark of three lines (II). The time limit is one arithmetic reasoning subtest is used both in the
minute. calculation of intelligence and in the calcula-
9. and 10. Place and turn. Two subtests use a tion of numerical aptitude; the same raw score
rectangular pegboard, divided into two sections, will be converted differently for each of these
each with 48 holes. The upper section contains purposes.
48 cylindrical pegs. For the place subtest, the pegs Raw scores are thus converted into aptitude
are removed from the upper part and inserted in scores, with a mean of 100 and SD of 20. These
the corresponding holes of the lower section, two aptitude scores can be further converted to com-
at a time. This is done three times (3 trials), with a posite scores because the nine basic dimensions
time limit of 15 seconds per trial. The score is the form three composites: a cognitive composite
sum of pegs moved over the 3 trials. For the turn (G + V + N), a perceptual composite (S + P
subtest, the 48 pegs, each of which is painted half + Q), and a psychomotor composite (K + F +
red and half white, are turned over and replaced in M). These three composites are then given dif-
the same hole, using one’s preferred hand. Three ferent relative weights to predict each of five job
trials are also given, 30 seconds for each trial. The families, that is, these are regression equations
score is the total number of pegs turned over, for developed empirically. Finally, percentile scores
the three trials. for the sum of the composites can be calculated
11. and 12. Assemble and disassemble. Two in accord with separate race norms for blacks,
subtests use a small rectangular finger dexterity Hispanics, and others.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

382 Part Four. The Settings

Reliability. Temporal stability of the cognitive ings of job performance and other relevant
aptitudes (G, V, and N) is quite acceptable, with criteria.
correlation coefficients typically above .80, and
seems to decline slightly as the test-retest interval Criticisms. One of the concerns is the sheer
increases. In general, the results compare quite amount of score conversion that takes place, to
well with those of similar measures in other test the point where even experts are not able to
batteries. reconstruct the nature and logic of the various
For the perceptual aptitudes (S, P, and Q), test- statistical steps (Hartigan & Wigdor, 1989). For
retest coefficients are in the .70s and .80s, while other criticisms of this battery, see Kirnan and
for the psychomotor aptitudes (K, F, and M) the Geisinger (1990).
test-retest coefficients are in the high .60s and
low .70s. In general, reliability is higher for sam-
How Supervise?
ples of adults than for samples of ninth and tenth
graders, probably reflecting the lack of matura- The How Supervise? test is a 70-item inventory
tion of the younger examinees. designed to assess a respondent’s knowledge of
Equivalent-form reliability in general parallels supervisory practices. The test was developed as
the test-retest reliability findings, with most coef- a doctoral dissertation, became quite popular in
ficients in the .80s and .90s. Internal-consistency the 1950 to 1975 period, but then began to lan-
reliability is not appropriate to assess the relia- guish, primarily because it was not revised. This
bility of speeded tests, and because the GATB is is a clear example of a successful product whose
so heavily dependent on brief time limits, inter- marketing efforts did not support its usefulness.
nal consistency is not an appropriate reliability As with most other tests, the How Supervise?
estimate. began as a pool of 204 items generated to repre-
sent good supervisory practices. These items were
Validity. The GATB has been studied intensively, then judged by groups of experts who identified
with more than 800 validity studies available ambiguous items, as well as “correct” responses.
in the literature. Much of the validity informa- A scoring key was constructed to represent the
tion on the GATB can be interpreted in light of modal correct response.
construct validity – i.e., does each subtest mea- The pool of items was then administered to a
sure the particular designated construct? The sample of 577 supervisors, who had been rated as
GATB manual (U.S. Department of Labor, 1970) to effectiveness by their superiors. A contrasted
presents considerable evidence comparing results groups analysis of the top and bottom 27% of
on the GATB to a variety of other aptitude and the sample indicated that 23 items statistically
vocational interest measures. Jaeger, Linn, and discriminated the two subsamples. However, 140
Tesh (1989) report that of the 51 convergent items were retained, and two forms of the test,
validity coefficients for the G composite, the each with 70 items distributed over three sec-
correlation coefficients ranged from .45 to .89, tions, were developed. All responses are indicated
with a median value of .75. Other representa- by desirable, undesirable, or? for the first two
tive results are a median value of .47 for form sections, and agree, disagree, or? for the third
perception, and a median value of .50 for cler- section.
ical perception. In general, the results strongly The first section is called Supervisory Prac-
support the construct validity of the cognitive tices and requires the respondent to indicate
aptitudes; there is less support for the construct whether specific supervisory practices are or
validity of the perceptual aptitudes, and no con- are not desirable. The second section, Com-
clusion can be made on the psychomotor apti- pany Policies, requires the respondent to indi-
tudes because, surprisingly, there is little available cate the desirability of various policies, such as
data. concerning labor unions or promotion policies.
In terms of criterion validity, the overall con- The third section, called Supervisor Opinions,
clusion is that the GATB has small but rela- requires the respondent to indicate agreement or
tively consistent positive correlations with rat- disagreement with various supervisory opinions,
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 383

in particular whether workers are motivated by performance is that there are two motivational
external punishments and rewards. Although the aspects related to high-quality managerial suc-
three subsections are scored separately, the scores cess. These aspects are given different names by
are summed into a single total. different researchers, but one refers to a “need
The 1971 manual provides norms for 12 sam- for advancement,” for seeking more demanding
ples of supervisors in a variety of industries, with levels of responsibility, and the other refers to the
median scores showing rather wide variability degree to which a person espouses high personal
from industry to industry. standards of performance.
Note that the How Supervise? measures knowl- Gough (1984; 1985) attempted to develop two
edge and not necessarily behavior. Knowing what scales on the CPI (see Chapter 4) that paral-
is a good supervisory practice does not necessar- lel these motives; he called them the Managerial
ily mean that the person will behave in a conso- Potential and the Work Orientation scales. Two
nant manner. Furthermore, a number of inves- samples of subjects were used, married couples
tigators have argued that the How Supervise? is and correctional officers. On the basis of several
basically an intelligence test (Millard, 1952). statistical criteria (see Gough, 1985, for details), a
40-item Work Orientation (WO) scale was devel-
Reliability. Reliability seems adequate. Both oped. Alpha coefficients for two other samples
alternate form and split-half reliabilities are typ- were .75, and test-retest correlations over a 1-year
ically in the .80s range. period were .62 and .70. A comparison of English
and French versions of the CPI given to the same
Validity. Three types of validity studies are avail- sample of bilingual high-school students yielded
able (Dobbins, 1990). One type involves the com- an interform correlation of .76 for males and .78
parison of scores obtained by individuals who for females.
occupy different levels of supervision, such as An analysis of WO scale scores with a variety
higher-level vs. lower-level supervisors. In gen- of other instruments indicated that high-scoring
eral, the results are that higher-level supervi- persons on the WO are well organized, optimistic,
sors score higher than lower-level supervisors, dependable individuals, who do not necessarily
although confounding aspects such as age and have exceptional intellectual abilities. The major
intelligence are typically not accounted for. A goal of the WO scale is to identify those individ-
second type of study looks at concurrent valid- uals who possess self-discipline, dependability,
ity with on-the-job ratings. Typical correlation perseverance, efficiency – essentially those quali-
coefficients are low but significant, in the .20s ties that are incorporated under the rubric of the
and .30s range, although a number of studies also “Protestant ethic.” Indeed, in a study of fraternity
report no significant correlations. A third group and sorority members, higher-scoring individu-
of studies focuses on the convergent and discrimi- als on the WO scale were characterized as respon-
nant validity of the How Supervise?, showing that sible, reliable, reasonable, moderate, dependable,
scores on this instrument are indeed significantly clear thinking, optimistic, stable, efficient, and
correlated with measures of intelligence. A typical mature.
study is that by Weitz and Nuckols (1953), who
used a modified version of the How Supervise?
The Wonderlic Personnel Test (WPT)
with district managers in a life insurance com-
pany. Scores were compared with various criteria The WPT was first published in 1938 and was
that reflected volume of sales, as well as person- designed to test adult job applicants in business
nel turnover. None of the correlation coefficients and industrial settings. New forms have been
were significant, except for a correlation of .41 developed and extensive norms have been col-
between number of right answers on section 1 lected since then, but the nature of the test has
and educational level. not changed. The WPT is a 50-item, 12-minute
test of general mental ability, named “personnel”
Work Orientation scale of the CPI (WO). One so as to reduce the candidate’s recognition that
of the major themes in the study of managerial this is a test of intelligence and the fear that may
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

384 Part Four. The Settings

be associated with taking a mental ability test. from .85 to .91, and between 81% and 94% of
The WPT was standardized on samples of adults the subjects in each group had IQs from the two
ranging in age from 20 to 65, with a total sample tests within 10 points of each other. These results
of more than 50,000 individuals. A recent man- do support the value of the WPT as a measure of
ual indicated that there are 16 alternate forms general intelligence; in many ways, it is a useful
available, including a large-print version for the screening instrument.
visually handicapped; it provided extensive nor-
mative data by age, race, gender, industry, type of Norms. The test manual recommends that sep-
job, and educational level. The items are of var- arate norms by race and ethnic group be used.
ious types such as yes-no, multiple choice, and Thus, what is recommended is that candidates’
requiring a numeric answer, and cover a broad test scores be converted to percentiles separately
range of cognitive abilities that include analogies, for each ethnic group based on that group’s
definitions, geometric figures, and so on. The norms. Selection or hiring of candidates is then
items are intermingled and presented in order made within each group.
of difficulty, with an average difficulty level of Although cutoff scores need to be established
about 60%. by a particular industry, based in part upon the
number of available candidates, the test manual
Administration. The test manual includes direc- does suggest minimum scores for various occu-
tions for administration with both a 12-minute pations as general guidelines. These range from a
time limit and with unlimited time, although the 30 for statistician and engineer, to 25 for foreman
usual procedure and the norms are based on the and private secretary, to 8 for unskilled laborer
timed administration. and janitor.
Because older individuals tend to do less well
Scoring. Handscoring of the test is very sim- on the WPT than younger persons, the manual
ple with the use of a scoring key – the score recommends adding a certain number of points
is simply the number of items answered cor- to the raw score, to reflect that person’s age. For
rectly. Although the manual recommends that example, a person aged 40 to 44 would receive
raw scores be used, these can be converted to the 4 extra points, whereas a person aged 60 to 69
more familiar IQs. would receive 11 points.

Reliability. Test-retest reliability coefficients are Legal issues. We discussed some legal issues in
in the .82 to .94 range, and interform reliabilities Chapter 5, but mention might be made here of
go from a low of .73 to a high of .95. Even internal the case of Griggs v. Duke Power (1971). In 1971
consistency coefficients are high, primarily in the the U.S. Supreme Court ruled that the tests used
mid .80s to mid .90s range. by Duke Power, a utility company, were inap-
propriate, in that these tests failed to measure the
Validity. The literature does indicate that reli- applicant for the job and adversely impacted pro-
able measures of general mental ability are valid tected groups; one of the tests was the Wonderlic.
as predictors of job performance in a wide vari- However, another company, PPG Industries, also
ety of occupational areas, and the WPT has used the WPT and in a 1983 court case, was able
been validated in a large number of studies to establish the validity of the selection battery
(Schmidt, 1985). Correlations with educational that included the WPT (Schoenfeldt, 1985).
level and academic achievement are quite sub-
stantial, with coefficients ranging from .30 to
INTEGRITY TESTS
.80. An example is the study by Dodrill and
Warner (1988) who administered the WAIS and These tests, also known as honesty tests, are
the WPT to four groups of subjects: hospitalized paper-and-pencil inventories used for person-
psychiatric patients, hospitalized nonpsychiatric nel selection to identify potentially dishonest
seizure-disorders patients, psychiatric epileptic or “counterproductive” employees. Usually these
patients, and normal control subjects. Correla- tests contain items that assess the applicant’s atti-
tions between test scores on the two tests ranged tude toward theft and inquire about any past
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 385

thefts. Such tests are typically proprietary tests scoring keys are typically closely guarded by the
purchased from a publisher who provides such publisher.
services as scoring and guidelines for score inter-
pretation. The development of most such tests Types of integrity tests. Two distinct types of
has taken place largely outside mainstream psy- tests fall under the rubric of integrity tests.
chological testing, so that these tests are not avail- The first are overt integrity tests that inquire
able for professional perusal, nor are their psy- directly about attitudes toward theft and about
chometric properties evaluated in the forum of prior dishonest acts. The second are broader-
the scientific literature. Relatively little psycho- oriented measures that are less transparent and
metric information is available about integrity perceive theft as one aspect of a broader syndrome
tests, and many of them seem to have been devel- of deviant behavior in the workplace (Sackett,
oped by individuals who have a minimum of Burris, & Callahan, 1989). These tests may
psychometric training. What is available often is include scales designed to measure drug use,
contained in in-house reports rather than public job satisfaction, the prediction of burnout, and
scientific journals. related variables.
American businesses lose anywhere from $15 Overt integrity tests usually have two sections –
to $25 billion per year because of employee theft, one that deals with attitudes toward theft and
and 30% of all business failures are attributed other forms of dishonesty and one dealing with
to employee theft (Camara & Schneider, 1994). admissions of theft and other illegal activities
Despite such impressive figures, the base rate such as drug use. Perhaps the three most com-
for a behavior like employee theft is very low in mon of these tests are the London House Person-
most settings – usually less than 5% (Hollinger nel Selection Inventory, the Reid Report, and the
& Clark, 1983). As discussed Chapter 3, when Stanton Survey.
base rates are low, accurate detection by tests is The broader personality-oriented measures
minimal (Dawes, 1962). Integrity tests then ful- are designed not as measures of honesty per se,
fill a major need and represent a major industry; but as predictors of a wide variety of counterpro-
in 1984 Sackett and Harris estimated that more ductive work behaviors. Some of these measures
than 5,000 firms were using such tests. These tests assess such personality aspects as nonconfor-
are used particularly in settings where employees mance, irresponsibility, lack of conscientious-
may have access to cash or merchandise, such ness, degree of socialization, and so on.
as banks and retail stores. They are used for
both entry-level personnel, such as clerks and Cutoff scores. Although most test manuals indi-
cashiers, and for higher-level positions, such as cate that scores on the integrity test should not
supervisory and managerial ones (for overviews be used as the only criterion for not hiring an
of this field see D. Arthur, 1994, and J. W. Jones, individual, most also provide cutoff scores that
1991). separate the score distribution into two areas,
Camara and Schneider (1994) were able to such as pass and fail, or three areas, such as high,
identify 46 publishers or developers of integrity medium, and low risk.
tests and requested their cooperation in complet-
ing a survey; 65% (n = 30) responded. Camara Reliability. Reliability of most of these tests
and Schneider (1994) reported that the majority seems to be consistently high, with reported test-
of these tests are computer scored by the test pub- retest coefficients typically above .70, and internal
lisher or are scored on-site using software pro- consistency coefficients often in the .90s range.
vided by the publisher, and typically both a total
score and subtest scores can be obtained. Scores Validity. Sackett and Harris (1984) reviewed the
are often reported in standard or percentile form, validity data for 10 integrity tests, with a focus
indicating the risk level for theft by the individual on criterion validity. Five categories of studies
applicant. Cutoff scores are provided by 60% of were available: (1) comparisons with polygraph
the tests studied. Computer-generated narrative (lie-detecting machine) results; (2) measures of
reports are available for most tests, although most future behavior, such as being discharged for
of these reports cover only basic results. Actual theft; (3) admission of past theft, where tests are
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

386 Part Four. The Settings

administered under conditions of anonymity; (4) of view. Subsequently, Sacket, Burris, and Calla-
shrinkage reduction – i.e., a store is monitored han (1989), were able to locate 24 studies. One
for amount of loss, then the integrity testing pro- of the findings is that little theft is detected.
gram is introduced, and subsequent rates of loss Thus, although the score differences between
are compared to previous losses; (5) contrasted those individuals who are detected for theft and
groups – for example, convicts vs. a control group those who are not are substantial, the correlation
are tested. coefficient is very small due to the lack of variance
What are the results? (1) Polygraph studies. on the criterion, i.e., few employees are detected
When both attitude and admission portions are for theft.
used, correlation coefficients between test scores An APA Task Force (Goldberg, Grenier, Guion,
and polygraph results range from .72 to .86, et al., 1991) reviewed some 300 studies and con-
with an average of .78. When only the atti- cluded that for those few tests for which validity
tude section is used, the range of coefficients information was available, their predictive valid-
goes from .27 to .76, with an average of .49. ity was supported. In part, this may have been due
Thus integrity-test scores correlate significantly to the fact that validity criteria for integrity tests
with polygraph results and even more so when were expanded to cover such aspects as absen-
admissions are incorporated into the procedure. teeism, personnel turnover, behavioral indica-
(2) Future behavior. In general, the correlations tors such as grievances, and supervisory ratings.
between test scores and future behavior are rather However, validity information was available for
low, often in the .20s, but there are a number only a few of the tests, and much of what was avail-
of methodological problems including the fact able was judged to be fragmented and incom-
that the base rate (the number of employees dis- plete. It was suggested that integrity tests predict
missed for stealing) is rather low. (3) Admis- more validly at the untrustworthy end of the scale
sions. Integrity test scores consistently correlate than at the trustworthy end.
with admissions of past theft, whether such data Ones, Viswesvaran, and Schmidt (1993) also
is collected as part of a preemployment pro- conducted a meta-analysis of the validity of
cedure, or anonymously by current employees. integrity tests. They concluded that the best esti-
(4) Shrinkage reduction. The small amount of mate of the mean true validity was .41, when the
evidence available indicates that the introduc- criterion is supervisory ratings of overall job per-
tion of integrity tests into a company’s proce- formance. They also concluded that the validity
dures lowers the amount of employee shoplift- for predicting counterproductive behavior on the
ing. Whether in fact such changes are due to job (such as theft and absenteeism) is also fairly
unknown other variables, to concurrent changes substantial, but that validity is affected by several
in company policy, or to perceived change in moderator variables such as whether the sample
a company’s tolerance of dishonest behavior, is is composed of applicants or current employees.
not clear. In fact, one of the common confound- Most integrity tests do not correlate signifi-
ing aspects is that the introduction of integrity cantly with measures of intelligence, with avail-
testing in a company may be perceived by the able coefficients typically in the –.10 to +.10
employees as a change in the company’s tolerance range. When scores on the attitude and the
of theft. (5) Contrasted groups. Studies of con- admissions subsections are correlated with each
victs vs. the general public indicate substantial other, significant coefficients are obtained, often
mean-score differences on typical integrity tests. in the .50 to .70 range.
Often the assumption is that convicts “fake good”
on these tests to increase their chances of parole, Construct validity. This is a much more com-
and therefore the tests are somewhat resistant to plex type of validity to establish, in part because
faking in that they still show significant group the construct is ill defined, and different integrity
differences. tests reflect different definitional points of view.
It should be pointed out that the above results In general, however, the available evidence seems
were based on only 14 studies, many of which congruent with construct validity. People known
were severely flawed from an experimental point to be dishonest, such as convicted thieves, do less
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 387

well on these tests than the general population On the Reid Report, for example, the evidence
does. Males do worse than females, and younger suggests that high scorers are emotionally stable,
people do worse than older people (Goldberg, relatively optimistic about human nature, flexi-
Grenier, Guion, et al., 1991). ble in their beliefs and attitudes, and motivated
to excel because of the intrinsic satisfaction that
The employer’s viewpoint. The APA Task Force work provides (Cunningham, Wong, & Barbee,
indicated that there were three concerns of inter- 1994).
est to employers. The first is the extent to which
the use of integrity tests leads to a generally Adverse impact. A few studies have looked at
improved workforce, e.g., having greater job sat- adverse impact. The available evidence indicates
isfaction. Some studies do indeed present such that there are no significant gender and/or race
evidence. A second concern might be the useful- differences on such tests; when differences are
ness of integrity tests in predicting the occurrence obtained, they typically favor females and blacks.
of such events as theft. The conclusion here is that
few predictive studies are free from methodolog- Alternatives. Alternatives to integrity tests have
ical difficulties and that most integrity tests have innumerable drawbacks (Goldberg, Grenier,
not been used in predictive studies. For a few Guion, et al., 1991). Unstructured interviews are
tests, however, the available evidence would sup- time consuming and have lower reliability and
port their predictive validity. A third concern is validity. Structured interviews do use many of the
the extent to which losses due to employee theft items identical to those found in integrity tests,
are actually decreased. Although the picture is but have limited validity and utility as a stand-
far from complete, the available evidence sug- alone screening instrument. Background checks
gests that for some integrity tests there is docu- are expensive and elicit legal concerns. Surveil-
mentation available to support their utility in this lance can be quite expensive and may be counter-
respect. productive to morale and productivity. The use
of polygraphs in employment screening is pro-
Faking. A major concern with integrity tests is hibited by the Employee Polygraph Protection
their resistance to faking. Three lines of inquiry Act of 1988. However, there are exceptions such
are relevant here (Sackett, Burris, & Callahan, as for governmental employees, agencies that are
1989): (1) the effects of direct instructions to fake involved in national security and defense, and
good; (2) the correlation of test scores with var- other agencies. Incidentally, the passage of this
ious indices of social desirability and “lie” scales act resulted in an increased use of integrity tests.
(see Chapter 16); and (3) studies that statistically Even if polygraph examinations were not illegal,
separate the effects of lying from the obtained they are of questionable validity.
validity coefficients. Few studies exist in each cat-
egory, and at least for category (1) the results Legal aspects. Many psychological tests have
are contradictory – although in both studies the been curtailed in their use by court actions and
instructions to fake did result in significantly legal issues (see Chapter 16). Integrity tests may
different mean scores. For the second category, be the only category of tests whose use is actu-
most available studies do show significant corre- ally enhanced by legal concerns. As the APA Task
lations between integrity test scores and various Force pointed out (Goldberg, Grenier, Guion,
measures designed to assess faking. Under cate- et al., 1991), under the “negligent hiring” legal
gory (3) the very limited evidence suggests that doctrine, an employer may be held liable if an
integrity tests do measure something beyond the “unfit” employee is hired who later causes harm
mere ability to fake. to coworkers or others, if the employer could
have either reasonably foreseen the risk of hiring
Factor analysis. Studies of factor analyses of such a person or had failed to conduct a reason-
integrity tests suggest that the instruments are able inquiry into the employee’s fitness. Integrity
not unidimensional, but little is known about the tests may serve as protection against such
utility and validity of the individual factor scores. charges.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

388 Part Four. The Settings

A typical study. Collins and Schmidt (1993) that are potentially useful, yet the test practitioner
looked at white-collar crime – i.e., nonviolent must be careful and mindful, not only of the basic
crimes for financial gain that utilize deception issues of reliability and validity, but also legal and
and are usually carried out by an individual rather public relations issues.
than an organization. They wondered whether
integrity and personality measures could dis-
criminate between white-collar criminals and SUGGESTED READINGS
their nonoffender colleagues.
Binning, J. F., & Barrett, G. V. (1989). Validity of per-
They studied 329 federal prison inmates who sonnel decisions: A conceptual analysis of the inferen-
had been convicted of white-collar crimes, who tial and evidential bases. Journal of Applied Psychology,
volunteered for the study, and 320 individuals 74, 478–494.
employed in white-collar positions of authority.
All subjects were administered the CPI, a biodata An excellent theoretical discussion of the concept of validity
as it pertains to personnel selection and decisions.
questionnaire, and an integrity scale; all three
instruments yielded a total of 37 scores for each Cascio, W. F. (1995). Whither industrial and organiza-
subject. tional psychology in a changing world of work? Amer-
For the statistical analyses, the total sample ican Psychologist, 50, 928–939.
was randomly divided into a validation sam-
ple and a cross-validation sample. The statis- Although not directly related to psychological testing, this
is an excellent overview of the current and future state of
tical analyses were conducted in three steps. the world of work. The author discusses key changes in such
First, the number of variables was reduced to a world and some of the research questions that must be
16 in the validation sample. Second, a discrim- answered. A careful reading of this article will indicate many
inant function was developed in the validation areas where the judicious appli-cation of psychological testing
could be most useful.
sample. Third, the discriminant function was
cross-validated. Among the top four variables
Cunningham, M. R., Wong, D. T., & Barber, A. P.
that discriminated between criminals and non-
(1994). Self-presentation dynamics on overt integrity
criminals was the Performance subscale of the tests: Experimental studies of the Reid Report. Journal
integrity test (the Performance scale is said to of Applied Psychology, 79, 643–658.
measure conscientious work attitudes and behav-
ior). The other three variables were scales from The Reid Report is one of the most popular integrity tests.
The authors conducted a series of three experiments to study
the CPI: Socialization, Responsibility, and Tol- the effects of instructions and monetary reward on faking of
erance. As dicussed in Chapter 4, the Socializa- the Reid Report.
tion scale measures the degree to which indi-
viduals adhere to social norms; the Responsibil- Day, D. V., & Silverman, S. B. (1989). Personality and
ity scale measures the degree to which an indi- job performance: Evidence of incremental validity. Per-
vidual is conscientious, responsible, and atten- sonnel Psychology, 42, 25–36.
tive to duty; the Tolerance scale assesses the An investigation of the relationship between specific per-
degree to which a person is tolerant and trusting. sonality variables, as measured by the Jackson Personality
The authors suggested that the common theme Research Form, and job performance as rated by seven man-
underlying these four scales is that of “social agers, for a sample of accountants. Three personality scales
were significantly related to important aspects of job perfor-
conscientiousness.” mance.

Haire, M. (1950). Projective techniques in marketing


SUMMARY research. Journal of Marketing, 14, 649–656.

In this chapter, we have looked at a variety of This is somewhat of a classic study in which two samples
issues, findings, and specific tests as they apply of subjects were asked to characterize a woman, based only
on her grocery shopping list. The shopping list presented to
to testing in the world of work. The overall pic-
the two samples was identical except for one item – in one
ture is one of positive findings, with substantial sample the item referred to instant coffee, in the other to
cautions. There are many tests and applications regular coffee. The results are intriguing.
P1: JZP
0521861810c14 CB1038/Domino 0 521 86181 0 February 24, 2006 14:49

Occupational Settings 389

DISCUSSION QUESTIONS 3. How might you answer someone who states


that, “psychological tests are useless in the world
1. Have you ever been given a test as part of a
of work”?
job? What was the experience like?
2. Consider the following two occupations: 4. Do you think that applicants for police work
should be psychologically tested?
nurse in an operating room vs. teacher of fifth-
grade children. How might job success be defined 5. If you had to indicate the three major themes
in each of these? of this chapter, what might they be?
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

15 Clinical and Forensic Settings

AIM This chapter looks at testing in two broad settings: testing in clinics or men-
tal health centers, and testing in forensic settings, settings that are “legal” in nature
such as jails, courtrooms, etc. Obviously, the two categories are not mutually exclusive.
A clinical psychologist may for example, evaluate a client, with the evaluation man-
dated by the courts. Under clinical settings, we look at neuropsychological testing,
projective techniques, some illustrative clinical issues and clinical syndromes, as well as
applications in the area of health psychology. Under forensic settings, we look at some
illustrative applications of testing, as well as how legislation has affected testing.

CLINICAL PSYCHOLOGY: The Halstead-Reitan Neuropsychological


NEUROPSYCHOLOGICAL TESTING Battery
This field formally began toward the end of the In 1935, at the University of Chicago, Halstead
Second World War in the early 1940s when clini- established one of the first neuropsychology lab-
cal psychologists were asked to test brain-injured oratories to study the impact of impairment in
soldiers to determine whether their behavior brain functions on behavior. He developed a
and difficulties reflected an “organic brain syn- number of tests to be used in this endeavor. One of
drome.” At first, available tests such as the his students, Reitan, subsequently modified and
Rorschach and the Wechsler were used for this expanded the battery in his own laboratory. The
purpose because they were already used for the result is three separate batteries, one for young
routine assessment of psychiatric patients. Subse- children (aged 5 to 8), one for older children
quently, Halstead and his student Reitan, devel- (aged 9 to 14), and one for adults. Each battery
oped a battery of tests designed specifically to includes a minimum of 14 subtests, as well as
assess the presence or absence of brain dys- the age-appropriate Wechsler test of intelligence,
function. The Halstead-Reitan Neuropsycholog- a broad-range academic achievement measure,
ical Test Battery had a profound impact and and for adults, the MMPI.
expanded the focus from whether there was a The test batteries have undergone extensive
brain lesion present to the determination of the research and revisions throughout the years. The
nature, location, and behavioral consequences adult battery contains five of the original Halstead
of such lesions (Prigatano & Redner, 1993). At tests. Two examples are:
present, there is only one other major neu- 1. The Category Test consists of 208 photo-
ropsychological test battery, called the Luria- graphic slides of geometric figures that are pro-
Nebraska (Golden, Purisch, & Hammeke, 1985; jected one at a time on a screen. The slides are
see Franzen, 1987, for a review). divided into subsets, and the items in a subset

390
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 391

follow a particular “rule” or category (e.g., all so that all test results can be studied with a mini-
figures have a missing portion in the upper left mum of paper shuffling. The entire battery takes
quadrant). The patient tries to figure out the rule some 5 hours to administer, more or less depend-
by pressing one of four levers associated with ing on the patient’s age, degree of brain damage,
one of four choices. If the response is correct, a and other aspects.
bell sounds; if the response is incorrect, a buzzer The aim of the Halstead-Reitan is not simply
sounds. Patients are told when a new subset of to diagnose whether there is brain injury, but to
slides is to begin. determine the severity of such injury, the specific
Basically, this is a learning task that requires a localization of the injury, the degree to which
high level of new problem-solving skills. Individ- right or left hemisphere functioning is affected,
uals impaired on this task because of brain lesions and to provide some estimate of the effectiveness
or other aspects, often show deficits in abstract of rehabilitation.
learning, judgment, and concept formation, and The battery requires some rather expensive
lack the ability to concentrate over a period of and bulky materials, so unlike the MMPI and
time. WAIS, it is not a portable test. Such testing is
Scoring is simply the number of errors, with a usually carried out in a clinic. Numerous modi-
score of 50 to 51 discriminating well between neu- fications of the battery are available; for example,
rologically impaired and neurologically intact in the Category Test individual stimulus cards
individuals. rather than slides can be used.
2. The Tactual Performance Test uses a wooden Most of the subtests used in the Halstead-
board with spaces for 10 geometrically shaped Reitan are actually “borrowed” from other proce-
blocks, such as a square, a triangle, and a cross. dures and tests. On the one hand, this means that
The patient is blindfolded and does not see the available psychometric information as to relia-
board or the blocks. He or she is guided to use bility and validity should generalize to this “new”
the dominant hand to locate the board and the use. On the other hand, the entire field of neu-
blocks and is told to place the blocks in their ropsychological assessment has paid little atten-
proper spaces as rapidly as possible. A second trial tion to such psychometric issues (Davison, 1974).
is done with the nondominant hand and a third Perhaps more than any other test we have
trial with both hands. The board and blocks are considered, the utility of the Halstead-Reitan
then removed, the patient’s blindfold is removed, is closely related to the competence of the test
and the patient is asked to draw a picture of the administrator. This is not an easy test to admin-
board on a sheet of paper. ister, score, and interpret. Although most clinical
Both the administration and scoring of this psychologists now receive some training in neu-
subtest are complicated. For example, the patient ropsychological assessment, administration, and
must be correctly blindfolded so that “peeking” interpretation of the Halstead-Reitan requires
cannot take place. Scoring is the time taken for considerable knowledge of neuropsychology and
completion of each trial, but there is a time limit supervised training.
of 15 minutes; however, extra time may be given
if the patient was near completion. Scoring of Validity. The validation of the Halstead-Reitan
the memory drawing requires a fair amount of has primarily focused on concurrent validity, and
subjectivity. The subtest is a measure of problem specifically on the ability of the battery to cor-
solving, specifically right-left differences in tac- rectly identify brain-damaged patients from non-
tile, kinesthetic, and motor abilities in the absence brain-damaged controls. As Whitworth (1987)
of visual cues. It also measures incidental mem- states, the reliability and validity of the Halstead-
ory, the ability to remember things even when Reitan are higher than for almost any other
not explicitly told to do so. type of psychometric procedure (see Heveren,
Other subtests in the battery include a finger 1980, and Sherer, Parsons, Nixon, et al., 1991,
tapping test (tapping a telegraph type key) and for examples of validity studies). Whether such a
a trail making test, which resembles a connect- broad and uncritical endorsement is warranted
the-dots procedure. The results are recorded on remains to be seen, but certainly, the reliabil-
protocol sheets, with a summary sheet available ity and validity of this battery in differentiating
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

392 Part Four. The Settings

between impaired and intact brain function is ambiguous and malleable stimuli to which a large
well established (Franzen, 1989; Heveren, 1980). number of different responses can be made. Pre-
sumably, the specific responses given by a client
Criticisms. There are many criticisms of the reflect something about that individual’s psycho-
Halstead-Reitan in the literature, some substan- dynamic functioning. Projective techniques no
tive and some less so. The apparatus is not mass longer occupy the dominant position they did
produced and is seen by many as clumsy and years ago, but nevertheless continue to be used in
very expensive. Test administration is long and clinical practice and research.
can be stressful; the length of the text means, in Most projective techniques fall into one of five
part, that it is a costly procedure to administer or categories (Lindzey, 1959):
may not be appropriate for a patient whose med-
ical condition places severe limits on their capac- 1. Associative techniques; the subject responds
ity to complete extensive testing. The scoring is to a particular stimulus, such as an inkblot or
characterized as simplistic and primarily aimed a word, by indicating what the stimulus suggests.
at assessing whether there is organicity or not, The Rorschach Inkblot Technique is a prime
rather than using a more normative approach. example.
Individual practitioners often modify the battery 2. Construction techniques; the subject con-
or use only part of it, so that standardization is structs a response, usually in the form of a story, to
not followed. Finally, the problem of false nega- a stimulus, usually a picture. The prime example
tives can be a significant one (Whitworth, 1987). here is the Thematic Apperception Test (TAT).
Some criticisms may not be fully deserved. For 3. Ordering techniques; involve placing a set of
example, there are normative scores based on age, stimuli in a particular order. Typically the stim-
education, and gender that allow the test results to uli are a set of pictures, very much like the panels
be changed to T scores for comparative purposes of a newspaper comic strip but the panels are
(R. R. Heaton, Grant, & Matthews, 1991). Other presented in random order, and they need to be
criticisms may not really be criticisms. For exam- placed in order to make a coherent sequence.
ple, the Halstead-Reitan does not reflect a par- The Picture Arrangement subtest of the WAIS
ticular theory of brain functioning; the subtests is sometimes used as an ordering technique.
were chosen empirically because they seemed to 4. Completion techniques; the subject responds
work. Some see this as a weakness, others as a to a “partial” stimulus. For example, the subject
strength. A theoretical framework is now avail- may be given the beginning of a story to complete,
able (e.g., Reitan & Wolfson, 1985). or a set of sentence stems (e.g., I am always . . .) to
Neuropsychological tests in general have a complete. Sentence completion tests are a prime
number of limitations. Prigatano and Redner example here.
(1993) identify four major ones: (1) not all 5. Expressive techniques; the subject engages in
changes associated with brain injury are reflected some “creative” activity, such as drawing, finger
in changed test performance; (2) test findings painting, acting out certain feelings or situations
do not automatically indicate the reason for the (as in psychodrama). The Draw-A-Person test is
specific performance; (3) neuropsychological test a good example.
batteries are long to administer and therefore
expensive; and (4) a patient’s performance is
influenced not just by brain dysfunction but also Controversy. Perhaps more than any other area
by a variety of other variables such as age and of testing, projective techniques are a source of
education. controversy and argument. Some psychologists
swear by these techniques and some swear at
them! Some see them as valuable ways to assess
PROJECTIVE TECHNIQUES
the psychodynamic complexities of the human
For many years, the primary testing tool of psyche, and others see them as closely related to
clinical psychologists were projective techniques superstitious behavior, tea readings, handwriting
such as the Rorschach Inkblot Technique. These analysis, astrology, and other systems that lack
techniques have in common the presentation of scientific proof.
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 393

Clinical usefulness. There is little doubt that in that each and every response is indeed basic and
the hands of a skilled and sensitive clinician, reflective of some major personal themes.
projective techniques can yield useful informa- The projective viewpoint further assumes that
tion and individual practitioners can utilize these perception is an active and selective process, and
measures to elicit superb psychodynamic por- thus what is perceived is influenced not only by
traits of a patient, and to make accurate pre- the person’s current needs and motivation, but
dictions about future behavior. Given this, why by that person’s unique history and the person’s
is there need for scientific validation? MacFar- habitual ways of dealing with the world. The more
lane and Tuddenham (1951) provided five basic ambiguous a situation the more the responses
answers to this question: will reflect individual differences in attempting
to structure and respond to that situation. Thus,
1. A social responsibility. Projective tests are mis- projective tests are seen as ideal miniature situa-
used, and we need to know which types of state- tions, where presentation can be controlled and
ments can be supported by the scientific literature resulting responses carefully observed.
and which cannot.
2. A professional responsibility. Errors of inter- Reliability. Standard reliability procedures, dis-
pretation can be reduced and interpretive skills cussed in Chapter 3, are applicable to quantita-
sharpened by having objective validity data. tive scores, such as those obtained on a typical
3. A teaching responsibility. If we cannot com- test. With projective techniques, the end result is
municate the basis for making specific inferences, a protocol, perhaps containing stories or inkblot
such as “this type of response to card 6 on the responses. To assess the reliability of such quali-
Rorschach typically means that . . .,” then we can- tative data, two general methods are used:
not train future clinicians in these techniques. 1. Determine the accuracy to which proto-
4. Advancement of knowledge. Validity data can cols can be matched (the degree of agreement
advance our understanding of personality func- to which a judge can match protocols with diag-
tioning, psychopathology, etc. nosis) e.g., of these 50 protocols, which belong
5. A challenge to research skills. As scientists, we to individuals diagnosed with schizophrenia and
ought to be able to make explicit what clinicians which belong to controls? Or given test-retest
use intuitively and implicitly. protocols, can we accurately match protocols to
the same person? Or can protocols be correctly
matched with the interpretations given? Most of
Basic assumptions. In general, psychologists these questions can be empirically answered, but
believe that behavior is determined or can be they confound the reliability of the test with the
explained by specific principles. If we observe reliability of the judgment process used. Thus we
a person verbally or physically attacking oth- may establish that a particular judge is able (or
ers, we label the behaviors as aggressive and we not) to match protocols with diagnosis, but we
seek explanations for the behavior, perhaps pos- don’t know that the same results apply to other
tulating “frustration” or looking for childhood clinicians, nor do we know what specific aspects
developmental explanations or antecedent con- of the protocol lead to particular judgments.
ditions. With projective tests, the assumption is 2. We can therefore select specific aspects of
that specific responses reflect the person’s person- protocols, and use rating scales to assess the reli-
ality and/or psychodynamic functioning. This is ability of such aspects. In Rorschach protocols,
based, however, on the questionable assumption for example, we might assess the number of cloud
that the test protocol presents a sufficiently exten- responses, or the number of depressive responses,
sive sampling of the client. or the overall “affective tone”; that is, we can
Second, we know that specific behaviors can be go from a very concrete specific counting proce-
strongly influenced by transitory aspects. A per- dure to a more abstract and inferential level. This
son can do well academically in all courses except approach also presents many problems. What
one, with performance in that course influenced categories are we to select? Instead of cloud
by a dislike for the instructor or some other responses should we score animal responses,
“chance” factor. Projective tests however, assume kitchen utensils, or – ? Quite possibly, the
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

394 Part Four. The Settings

categories that might be reliably counted might developed a slightly different scoring system.
not be particularly meaningful. If we select the Exner (1974) combined these approaches and
more abstract levels, we again confound the reli- developed a “comprehensive system” which is
ability of the test with the reliability of the rater. now used quite widely. He united a number of
In general then, the establishment of relia- theoretical trends and aspects of the various scor-
bility is problematic. Test-retest seems incon- ing systems, provided extensive norms, and basi-
gruent with the notion that whatever projec- cally a common language. Exner attempted to
tive techniques measure (personality?), changes develop the Rorschach into more of an objective
over time. Alternate-forms reliability is often not and less a projective test (Exner, 1999).
applicable because even when there are two forms
of a test, the individual stimuli can be quite dif- Description. The Rorschach consists of 10
ferent. Internal consistency is typically not appli- symmetrical inkblots each printed on a sepa-
cable because it can only be applied to homoge- rate 61/2 × 91/2 inch plate. Five of the inkblots
neous items that are directly scorable. are black and gray, and five have other colors, all
on white backgrounds. Often the instructions are
Validity. Validity is even more problematic not standardized but simply require the subject
because most projective tests are designed to to tell what the blots remind him or her of, or
assess rather broad domains such as “personal- what they might represent, or what they could be
ity,” rather than more specific questions, such as (B. Klopfer, Ainsworth, W. G. Klopfer, et al.,
“Is the number of human movement responses 1954). The inkblots are presented one at a time,
on the Rorschach indicative of creative output?” in the same order. This presentation is called the
Thus, although specific questions can and have “free association” or response phase. Questions
been studied, often with nonsupportive results, that the subject may have are typically responded
the clinician argues that projective techniques to in a noncommittal way (e.g., however you like).
are global or holistic in nature and that such The inkblots are then presented a second time
piecemeal validation violates the spirit of such for the “inquiry” phase. Here the examiner asks
measures. whatever questions are necessary to determine
what precisely the response was, where in the
inkblot it was seen, and why it was seen – i.e., what
The Rorschach Inkblot Technique
aspects of the inkblot contributed to the percep-
The Rorschach was developed in 1921 by Her- tion. The information obtained here is to be used
mann Rorschach, a Swiss psychiatrist. Originally, primarily for scoring the responses. Sometimes,
the test consisted of some 15 inkblots that had a third administration of the inkblots is used
been created by dropping blobs of ink on pieces of (called “testing the limits”) where the examiner
paper and folding these to form somewhat sym- explores whether other kinds of responses might
metrical figures. Printing was in its infancy then, be elicited; for example, if the patient gave orig-
and only 10 inkblots could be printed because inally only bizarre responses, could that patient
of budgetary constraints (in fact, the publisher perceive more “normal” aspects?
went broke after publishing the inkblots with
an accompanying monograph). The theoretical Scoring. There are five scores that are obtained
background of the Rorschach was Jungian and for each response:
psychoanalytic, but much of the evidence was
empirical. Rorschach had administered inkblots 1. Location – Does the response cover the whole
to different patients and people he knew and of the blot, or part of it? If only a part, is that
had retained those whose responses and patterns detail large or small, usual or unusual?
of responses seemed to be related to personality 2. Determinant – What aspects of the inkblot,
and psychodynamic functioning. In the United at least as perceived by the subject, determined
States, the Rorschach became extremely popu- the response? Was it the form of the inkblot, the
lar due to primarily five psychologists: Samuel color, the shading?
Beck, Marguerite Hertz, Bruno Klopfer, Zygmunt 3. Content – Each response is classified accord-
Piotrowski, and David Rapaport, each of whom ing to content. For example, did the person see a
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 395

human figure, an animal, a geographical concept, how the subject responds to the world, especially
food, clouds, man-made objects, . . . ? to ambiguity and challenge.
4. Popularity or originality (sometimes this is
subsumed under content) – Is the response given Basic issues. The Rorschach is used for multiple
a popular one or is it original? Popularity can purposes, from providing a psychiatric diagnosis
be defined by a list of 10 responses given by B. to assessing a person’s level of creativity. Thus,
Klopfer, Ainsworth, W. G. Klopfer, et al., (1954), the issue of validity becomes rather complex –
and originality is a response that appears no more validity for what?
than once in a 100 protocols. There are also nor- Another major issue is whether the Rorschach
mative tables that one can use to make these is a test and therefore accountable as to reliability
judgments. and validity, or a technique or method to generate
5. Form quality or accuracy of percept – Each information about personality functioning that
response can be scored on a rating scale as to the transcends psychometric issues.
extent to which the response fits the blot, and
the extent to which a response is elaborated (for Psychometric views. As I. Weiner (1977) indi-
example, a “dog” vs. a “Scottish terrier sitting up cated, there are three types of opinions regard-
begging”). In the Klopfer system the scale goes ing the psychometric status of the Rorschach.
from −2 to +5. The first holds that the Rorschach should be dis-
carded as an assessment procedure because it
Once the responses are scored, various per- has no demonstrated validity. The second point
centages are tabulated because the number of of view holds that the Rorschach is not a test
responses can vary from person to person. For but a technique whose utility is a function of
example, a “typical” protocol will have less than the examiner’s skills and intuition, and therefore
5% responses that use uncommon small details. the usual psychometric criteria do not apply. A
A “typical” schizophrenic protocol will have a third approach represents something of a middle
higher percentage of these responses, typically ground – the Rorschach is a test and its utility
about 1 in 5. Once these percentages are tabu- needs to be assessed objectively, but the chal-
lated, the psychologist weaves these results into lenge is to develop more sophisticated experi-
an overall interpretation, combining clinical skill mental designs that reflect the complexity of the
and art with normative data, experience, and Rorschach and of the clinical decisions that are
internal norms. made.
I. B. Weiner (1994) argues that the Rorschach
is not a test because it does not test anything,
Approach. Originally, many psychologists con- i.e., a test is designed to measure whether some-
sidered the Rorschach as providing an “X-ray” thing is present or not and in what quantity.
view of the internal psychodynamics of the The Rorschach is a method of generating data
patient. Recently, the Rorschach is considered a that describe personality functioning, and thus it
perceptual cognitive task that presents ambigu- should be called the Rorschach Inkblot Method.
ous visual stimuli. The responses are thought to The Rorschach itself can be seen from two
represent the strategies used by the patient in validity perspectives. From the viewpoint of
dealing with the world. Note that Rorschach him- criterion validity, the focus is on how spe-
self did not emphasize the imaginative or con- cific Rorschach variables or signs are related to
tent aspects of the responses to the inkblots, but actual criteria, such as psychopathology or cop-
rather focused on the perception aspects (those ing styles. This is the approach of Exner and his
aspects of the response that reflected the form colleagues. From the viewpoint of construct valid-
of the response, whether the entire inkblot was ity, the focus is more on the Rorschach as reflec-
being used or just a small detail). Under the Exner tive of personality functioning, particularly from
approach, the Rorschach is seen as a problem- a psychoanalytic point of view. This has been the
solving task and the responses as reflective of classical or traditional approach. (For an excel-
styles of coping behavior. That is, how the subject lent discussion of these issues, see H. Lerner &
responds to the Rorschach is seen as evidence of P. M. Lerner, 1987.)
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

396 Part Four. The Settings

Perceptual vs. content approaches. One of the nature of the inkblots and the visual aspects
major distinctions in the area of assessment is that of the task. The fourth possibility, the percep-
of nomothetic and idiographic approaches. The tual idiographic approach has never come about
nomothetic approach has as its aim the discovery (Aronow, Reznikoff, & Moreland, 1995).
of general laws, while the idiographic approach
focuses on individual differences and the unique- Faking. Still another issue is the susceptibility of
ness of the individual (see Chapter 19). the Rorschach to faking. In a classical study done
Aronow, Reznikoff, and Moreland (1995) by Albert, Fox, and Kahn (1980), Rorschach pro-
argue that there is another distinction to be kept tocols were given to experts to judge. Some of
in mind when dealing with the Rorschach – the protocols came from paranoid schizophrenic
perceptual vs. content approaches. Those who patients, some were from normal individuals,
favor the perceptual approach emphasize those and some were from college students who had
aspects of how the subject perceives, such as loca- been instructed, to varying degrees, to fake the
tion, form level, and determinants. Those who responses as if they were paranoid schizophren-
emphasize the content approach emphasize what ics. More of the protocols from the faking subjects
the subject perceives. These two dimensions can were identified as psychotic (from 46% to 72%);
result in four “theoretical” stances: (1) the per- 48% of the schizophrenic protocols were diag-
ceptual nomothetic; (2) the content nomothetic; nosed as psychotic, as well as 24% of the normal
(3) the content idiographic; and (4) the percep- protocols. Although in the usual application of
tual idiographic. the Rorschach diagnostic status is not the aim,
The perceptual nomothetic approach is how these findings are somewhat disturbing.
Rorschach began. He felt that the scoring of the
determinants was most important, and the scor- Limitations. From a psychometric point of view,
ing of content, the least important. Subsequently, the Rorschach leaves much to be desired. There
however, Rorschach shifted his orientation to a are only 10 items (inkblots), so its reliability is
more content and psychoanalytic approach that inherently limited. Different individuals can and
focused on projection rather than perception. do give differing number of responses, so that
Most scoring systems of the Rorschach taught in conclusions and interpretations are made using
American graduate schools emphasize this per- different databases. There is only one form of the
ceptual nomothetic approach, and the Exner sys- Rorschach, so alternate form reliability cannot be
tem would fall here also. computed, and pre- and poststudies are limited
The content nomothetic approach focuses on to using the same form.
the categories of content to be scored, such as
human responses, animal responses, etc. A num- Reliability and validity. When there are many
ber of content scales have been developed to mea- studies on one topic, and their results seem con-
sure a variety of variables such as depression and flicting – as occurs with the Rorschach – a useful
primary process thinking. According to Aronow procedure is to analyze overall trends through
and Reznikoff (1976), these content scales have a metaanalysis. Parker (1983) conducted such
shown better psychometric properties than per- an analysis on 39 studies; studies that had been
ceptual scores, but because of inadequate reliabil- published between 1971 and 1980 in one jour-
ity are not recommended for use with individual nal, and that met certain experimental criteria.
subjects in clinical settings. He reported that reliabilities in the order of .83
The content idiographic approach focuses on and higher, and validity coefficients of .45 and
the content, the responses of the individual, and higher, can be expected with the Rorschach when
what these responses indicate about this par- hypotheses supported by empirical or theoretical
ticular individual. The approach here is a psy- rationales are tested using reasonably powerful
chodynamic one that focuses on the uncon- statistics (for an example of a specific study, see
scious aspects of mental functioning. From this Bornstein, Hill, Robinson, et al., 1996). Hiller,
approach, the very strength of the Rorschach Rosenthall, Bornstein, et al., (1999) concluded
is the unlimited freedom the subject has to that the validity of the Rorschach is not signif-
respond, or not, as well as the ambiguous icantly different from that of the MMPI, with
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 397

both instruments showing mean validity coeffi- agreement ranging from 88% to 100%. Experi-
cients of about .29 to 30. An opposite conclusion mental group subjects (with special instructions)
however, was reached by Garb, Florio, and Grove repeated about one third of their responses, while
(1998) – that the Rorschach is not as valid as the control subjects (standard instructions) repeated
MMPI. about two thirds of their responses.
The protocols were scored on 27 different vari-
Percentage of agreement. Percentage of agree- ables, with only 5 variables showing significant
ment is a poor measure of reliability. If a test has changes from test to retest. Most of the test-
perfect reliability, we would expect 100% agree- retest correlations were above .70, even for the
ment among scorers. If the test has zero reliability, experimental group which was instructed to give
however, the lowest chance level of agreement is different responses.
not zero, but 50%. If for example, the judgment is G. J. Meyer (1997) concluded that Exner’s
required in a yes-no format (e.g., Is this response Comprehensive system has excellent interrater
an aggressive response? Does this Rorschach pro- reliability with a mean coefficient of .86, but
tocol show evidence of psychotic thinking? etc.), Wood, Nezworks, and Stejskal (1997) strongly
with two raters giving completely blind guesses, disagreed.
we have four possible results: one rater says “yes”
and the other “no” and vice versa; both say “yes”
The Holtzman Inkblot Technique (HIT)
or both say “no.” With four possibilities, two or
50% represent agreement (regardless of correct- The HIT (Holtzman, Thorpe, & Swartz, 1961)
ness). Furthermore, if either or both raters make was designed to overcome most of the psycho-
a particular judgment more often than 50%, metric limitations of the Rorschach. The HIT
the rate of agreement will increase. Suppose, for consists of two sets of 45 inkblot cards (Forms
example, that two Rorschach raters each have a A and B), and the subject is required to give only
tendency to label 80% of the responses as aggres- one response per card. Exact wording of instruc-
sive in nature, their rate of chance agreement will tions is not given to avoid “stiffness” and main-
actually be about 68%, a rather respectable fig- tain rapport, but three points should be made: (1)
ure, although in this case it would reflect biased the inkblots are not made to represent something
blind guessing. Wood, Nezworski, and Stejskal in particular; (2) different people see different
(1996) soundly criticize the Exner system on this things; and (3) only one response is required per
basis, and show that even a percentage agreement card. Following each response, a brief inquiry is
of 92% could reflect no agreement on the part of made to determine the location of the percept, the
any two raters. determinants, and a general question to encour-
Grønnerød (1999) computed percentage age elaboration (what else can you tell me?).
agreement, correlations, and Kappa coefficients,
and found Kappa to be “conservative and Development. Literally hundreds of inkblots
reliable”; this author recommends Kappa as a were created as an item pool. Blots were then
standard estimate of interrater agreement (see his selected on the basis of a number of principles
article for the formulas to compute Kappa). having to do with such aspects as shading and
color, and the degree to which they elicited differ-
Representative study. A representative study ent responses. Eventually, three sets of 45 inkblots
using the Exner Comprehensive System is that each were developed and administered to sam-
by Haller and Exner (1985). They administered ples of college students and psychotic patients.
the Rorschach to 50 patients, mostly women, with The responses were then analyzed, ultimately on
complaints and symptoms of depression. Three six dimensions, and inkblots retained both on
to four days later they were retested. At retest, the basis of whether responses could be discrim-
one half of the sample were given special instruc- inated between the two samples, and other con-
tions to remember what responses they had given siderations such as reliability.
the first time, and to give different responses.
Thirty percent of the protocols were scored by Scoring. Each response is scored on 22 variables,
two psychologists, with percentage of interscorer each variable carefully defined, using a numerical
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

398 Part Four. The Settings

scale. These variables range from reaction time The personality of a Truman Capote is distinctly
(the time between presentation of the inkblot to different from that of a Charles Dickens, and one
the beginning of the response), to a 3-point loca- need not have a doctorate in literature to perceive
tion score (0 for whole blot, 1 for large areas, and the major differences between these two authors
2 for smaller areas), to “pathognomic verbaliza- from their writings. It was this type of obser-
tion” (i.e., deviant autistic responses), such as “a vation that led Murray and Morgan to develop
person with the head of a chicken”). the TAT, where the respondent is asked to make
up stories in response to a set of pictures. Like
Reliability. Four major types of reliability studies the Rorschach, the TAT is used extensively and
were carried out by Holtzman and his colleagues: also has received a great deal of criticism. The
(1) studies of intra-scorer consistency (the degree TAT was introduced in 1935 and consists of a
of agreement when the same protocols are scored series of 31 pictures, most of which are rela-
twice by the same individual); (2) inter-scorer tively ambiguous. The subject is shown a pic-
reliability (the degree of agreement between two ture and asked to make up a story that reflects
or more different scorers); (3) intra-subject sta- what is going on, what has happened, what will
bility (the degree of internal consistency); and happen, and the feelings of the various charac-
(4) delayed intra-subject stability (the degree of ters depicted. The resulting stories are assumed
agreement between alternate forms, with varying to reflect the person’s needs, emotions, con-
test retest time intervals). flicts, etc., at both the conscious and unconscious
These studies were carried out on a wide vari- levels.
ety of samples, from 5-year-olds to depressed
patients. In general, intra-scorer reliability coef- Variations. Many variants of the TAT approach
ficients were high, typically in the mid .90s. Inter- have been developed, including sets of cards
scorer reliability was also high, with coefficients that depict animal characters for use with chil-
in the .80s and .90s. Internal consistency also was dren (e.g., the Children’s Apperception Test, Bel-
high, with typical coefficients in the high .80s lak, 1975), a specific animal character, the dog
and low .90s; some coefficients were substan- Blacky, in situations depicting crucial psycho-
tially lower but these typically were on variables analytic concepts such as castration anxiety (the
where the scores were not normally distributed Blacky Pictures Test, G. S. Blum, 1950), sets for
(for example, the variable of “space,” responses use with the elderly (the Gerontological Apper-
based on the white space, which are infrequent). ception Test, R. L. Wolk & R. B. Wolk, 1971),
Finally, test-retest studies showed close compara- with families (Julian, Sotile, Henry, et al., 1991),
bility of the two forms, with most variables show- with specific ethnic or cultural groups (Bellak,
ing stability over time. 1986), and sets developed according to stan-
dard psychometric procedures (e.g., the Picture
Validity. There are hundreds of studies that have Projective Test, Ritzler, Sharkey, & Chudy, 1980;
looked at the validity of the HIT, typically by and the Roberts Apperception Test for Children,
comparing diagnostic groups, such as psychotic McArthur & Roberts, 1982).
vs. normal, or correlating specific HIT variables
with personality test scores. In general, the results Description. Twenty of the TAT cards are desig-
support the validity of the HIT, although the nated as appropriate for either boys or girls, or for
results are typically modest. Although the HIT adult males or females; eleven of the cards have no
has been useful in research projects, it has not designation, with one of these being a blank card.
caught on with clinicians. Those clinicians who (See Morgan, 1995 for a detailed description of
are likely to use an inkblot test, tend to select the each of these pictures and their historical origin).
Rorschach. It was intended that 20 cards be selected for a par-
ticular subject, 10 of which would be appropriate
for the person’s age and gender. Typically, most
The Thematic Apperception Test (TAT)
clinicians use somewhere between 6 and 10 cards,
When we read a story, we not only learn about selected on the basis of the clinician’s judgment
the fictitious characters but also about the author. that the card will elicit thematic information
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 399

related to the client’s functioning or sometimes elicit stories that are gloomy or melancholic
on the basis of published recommendations (e.g., (Goldfried & Zax, 1965). There is some evidence
Arnold, 1962; see A. A. Hartman, 1970, for a pop- to suggest that the actual TAT card may be more
ularity ranking of TAT cards). important than the respondent’s “projections”
The pictures are quite varied. In one for exam- in determining the actual emotional tone of the
ple, which is traditionally used as the first card, story (e.g., Eron, Terry, & Callahan, 1950).
there is a seated young boy contemplating a violin
that rests on a table in front of him. Another pic- Scoring. H. A. Murray (1938) developed the TAT
ture shows four men in overalls lying on a patch in the context of a personality theory that saw
of grass; still another is of a man clutched from behavior as the result of psychobiological and
behind by three hands. environmental aspects. Thus not only are there
needs that a person has (both biological needs,
Administration. Although theoretically the TAT such as the need for food, and psychological,
could be used with children, it is typically used such as the need to achieve or the need for con-
with adolescents and adults. The original man- trol), but there are also forces in the environ-
ual (H. A. Murray, 1943) does have standard- ment, called press, that can affect the individual.
ized instructions, but typically examiners use Presumably, the stories given by the individual
their own versions. What is necessary is that the reflect the combination of such needs and presses,
instructions include the points that: (1) the client both in an objective sense and as perceived by the
is to make up an imaginative or dramatic story; person.
(2) the story is to include what is happening, what In most stories, there is a central figure called
led to what is happening, and what will happen; the hero, and it is assumed that the client identi-
(3) finally, it should include what the story char- fies psychologically with this hero. Both the needs
acters are feeling and thinking. and the presses are then identified, and each is
As part of the administration, the examiner scored on a 1- to 5-point scale in terms of inten-
unobtrusively records the response latency of sity and how central their expression. Murray
each card, i.e., how long it takes the subject (1938) originally identified some 36 needs, but
to begin a story. The examiner writes down others have reduced or increased this list. Fol-
the story as accurately as possible, noting any lowing Murray’s example, there were a number of
other responses (such as nervous laughter, facial attempts to develop comprehensive scoring sys-
expressions, etc.). Some examiners use a tape tems for the TAT. A number of manuals are avail-
recorder, but such a device may significantly alter able that can be used (e.g., W. E. Henry, 1956;
the test situation (R. M. Ryan, 1987). M. I. Stein, 1981), although none have become
Often, after all the stories have been elicited, the standard way, and ultimately the scoring
there is an inquiry phase, where the examiner may reflects the examiner’s clinical skills and theo-
attempt to obtain additional information about retical perspective.
the stories the client has given. A variety of tech- A number of scoring procedures have been
niques are used by different examiners, including developed for these stories (e.g., Bellak, 1986;
asking the client to identify the least preferred and Shneidman, 1951), but typically in clinical set-
most preferred cards. tings, as opposed to research studies, the inter-
pretation is based not on quantitative analysis,
Pull of TAT cards. TAT cards elicit “typical” but on a qualitative assessment, often couched in
responses from many subjects, somewhat like psychodynamic theory. Analysis of TAT proto-
the popular responses on the Rorschach. This is cols is often impressionistic – a subjective, intu-
called the “pull” of the card (e.g., Eron, 1950), and itive approach where the TAT protocol is perused
some have argued that this pull is the most impor- for such things as repetitive themes, conflicts,
tant determinant of a TAT response (Murstein, slips of the tongue, degree of emotional control,
1963). Many of the TAT cards are from wood- sequence of stories, etc. As with the Rorschach,
cuts and other art media, with lots of shadings the interpretation is not to be done blindly but
and dark, sometimes indistinguishable details. in accord with other information derived from
Because of this stimulus pull, many of the cards interviews with the client, other test results, etc.
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

400 Part Four. The Settings

In effect then, the utility of the TAT is, in large Research uses. The TAT has also been used
part, a function of both the specific scoring for research purposes with perhaps the best
procedure used and the talent and sensitivity of known example as a measure of need achieve-
the individual clinician. ment (McClelland, Atkinson, Clark, et al., 1953;
Many specific scoring guidelines have also been see Heckhausen, 1967, for a review).
developed that focus on the measurement of a
specific dimension, such as gender identity (R. Manuals. There is an extremely large body of
May, 1966) or achievement motivation (McClel- literature on the TAT, not just in terms of jour-
land, Atkinson, Clark, et al., 1953). A recent nal articles, but also scoring manuals, chapters in
example is a scoring system designed to mea- books on projective techniques, entire books on
sure how people are likely to resolve personal the TAT, and critical reviews (see Varble, 1971).
problems; for each card a total score as well as four
subscale scores are obtained, and these are aggre- Reliability. The determination of the reliability
gated across cards (Ronan, Colavito, & Ham- (and validity) of the TAT is a rather complex mat-
montree, 1993). ter because we must ask which scoring system is
being used, which variables are scored, and per-
What does the TAT measure? First and foremost haps even what aspects of specific examinees and
TAT stories are samples of the subject’s verbal examiners are involved.
behavior. Thus, they can be used to assess the Eron (1955) pointed out that the TAT was a
person’s intellectual competence, verbal fluency, research tool, one of many techniques used to
capacity to think abstractly, and other cognitive study the fantasy of normal individuals, but that
aspects. Second, the TAT represents an ambigu- it was quickly adopted for use in the clinic with-
ous situation presented by an “authority” figure, out any serious test of the reliability and validity
to which the subject must somehow respond. of the many methods of analysis that were pro-
Thus some insight can be gained about the per- posed. He pointed out that there are as many
son’s coping resources, interpersonal skills, and ways of analyzing TAT stories as there are practi-
so on. Finally, the TAT responses can be assumed tioners, and that few of these methods have been
to reflect the individual’s psychological func- demonstrated to be reliable.
tioning, and the needs, conflicts, feelings, etc., Some would argue that the concept of relia-
expressed in the stories are presumed to reflect bility is meaningless when applied to projective
the client’s perception of the world and inner psy- techniques. Even if we don’t accept that argu-
chodynamic functioning. ment, it is clear that the standard methods of
TAT stories are said to yield information about determining reliability are not particularly appli-
the person’s: (1) thought organization, (2) emo- cable to the TAT. Each of the TAT cards is unique,
tional responsiveness, (3) psychological needs, so neither split-half nor parallel-form reliability
(4) view of the world, (5) interpersonal relation- is appropriate. Test-retest reliability is also lim-
ships, (6) self-concept, and (7) coping patterns ited because on the one hand the test should be
(Holt, 1951). sensitive to changes over time, and on the other,
Holt pointed out that the responses to the the subject may focus on different aspects of the
TAT not only are potentially reflective of a per- stimulus from one time to another.
son’s unconscious functioning, in a manner par- The determination of reliability also assumes
allel to dreams, but there are a number of that extraneous sources of variation are held in
“determinants” that impact upon the responses check, i.e., the test is standardized. This is clearly
obtained. For example, the situational context is not the case with the TAT, where instructions,
very important. Whether a subject is being eval- sequence of cards, scoring procedure, etc., can
uated as part of court-mandated proceedings or vary.
whether the person is an introductory psychol-
ogy volunteer can make a substantial difference. Validity. Validity is also a very complex issue,
The “directing set” is also important, i.e., the pre- with studies that support the validity of the TAT
conceptions that the person has of what the test, and studies that do not. Varble (1971) reviewed
tester, and testing situations are like. this issue and indicated that: (1) the TAT is not
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 401

well suited or useful for differential diagnosis; (2) also associated with success, but only at the lower
the TAT can be useful in the identification of per- managerial levels. None of the variables was asso-
sonality variables, although there are studies that ciated with success for technical managers who
support this conclusion and studies that do not; had engineering responsibilities.
(3) different reviewers come to different conclu-
sions ranging from “the validity of the TAT is TAT in research setting. K. L. Cooper and
practically nil” to “there is impressive evidence Gutmann (1987) used five TAT cards to assess
for its validity.” pre- and post-empty-nest women (women whose
Holt (1951) pointed out that the TAT is not a children were still living at home vs. women
test in the same sense that an intelligence scale whose children had left). The results were in line
is, but that the TAT really reflects a segment of with the theoretical expectations of the authors in
human behavior that can be analyzed in many that the TAT stories of post-empty-nest women
ways. One might as well ask what is the reliability showed more active “ego mastery.”
and validity of everyday behavior. It is interesting
to note that Bellak’s (1986) book on the TAT,
which is quite comprehensive and often used as a Are instructions the key? The TAT is the focus of
training manual, does not list either reliability or much controversy, with advocates finding the test
validity in its index. But the TAT continues to be quite useful, and critics pointing out the lack of
of interest to both practitioners and researchers reliability and validity. Lundy (1988) suggested
(e.g., Cramer, 1999). that these divergent views may be a reflection
of the way the test is administered. He admin-
Gender differences. A. J. Stewart and Chester istered four TAT cards to samples of adolescents
(1982) reviewed a number of studies of gender using one of four instructional sets: (1) neutral,
differences on the TAT and concluded that gender standard instructions; (2) following a personal-
differences were “inconclusive,” in part, because ity test; (3) with emphasis that the TAT is a per-
many of the studies administered different cards sonality test; and (4) with nonthreatening but
to male subjects than those used for female sub- structured instructions. Comparisons with vari-
jects. Worchel, Aaron, and Yates (1990) admin- ous criteria indicated that the stories written after
istered both male and female cards to both male the neutral instructions were valid predictors of
and female subjects. Their analyses indicated that the three need dimensions that were scored, but
female TAT cards elicited more responses on a with the other instructional sets nonsignificant
“general concerns” dimension, and that female results were obtained.
subjects gave more responses on an “interper-
sonal relations” scale. Thus, they obtained both
Sentence Completion Tests
a gender difference between types of TAT cards
and between subjects of different gender. A sentence completion test consists of a number
of incomplete sentences where only the stem is
TAT in an applied setting. An interesting appli- presented; the client is asked to complete each
cation of the TAT can be found in the study of sentence, typically with the first response that
McClelland and Boyatzis (1982), who studied the comes to mind. Sometimes stems are selected
TAT protocols of 237 managers at the American for their potential response value (e.g., “My
Telephone and Telegraph Company. The TAT had father . . .”), and sometimes they are quite open
been administered to them when they joined the ended (e.g. “I always . . .”). Presumably, the
company, and the investigators correlated various responses reflect the person’s personality, psy-
TAT variables to the levels of promotion attained chodynamic functioning, motivation, conflicts,
after 8 and 16 years. A leadership motive pattern, degree of adjustment, and so on. As with most
defined as a moderately high need for power, other projective techniques, the results are inter-
a lower need for affiliation, and a high degree preted by an impressionistic, subjective, intuitive
of self-control, was significantly associated with approach, and less usually by scoring the comple-
managerial success after 8 and 16 years for non- tions and using the obtained scores in a normative
technical managers. Need for achievement was fashion.
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

402 Part Four. The Settings

Many sentence completion tests are available, ment are typically in the .20 to .50 range, modest
and it is not unusual for individual clinicians or at best.
agencies to use their own form, which typically Normative data on the Rotter are based on
lacks standardization and psychometric data as a medium-sized sample (about 300) of college
to reliability and validity, as well as appropri- students, and are probably no longer current;
ate norms. Over the years, a number of sen- by 1981, significantly different mean scores were
tence completion tests have also been published being reported (Lah & Rotter, 1981).
commercially, some for use with specific samples Clinicians who use tests such as the Rotter see
such as older persons, and some to assess specific its value as allowing the client to respond more
variables such as self-concept, attitudes toward freely than if the same questions were asked in
authority figures, etc. an interview; thus, some see the Rotter as a semi-
Although no one sentence completion test structured interview that gives the client more
dominates the market, among the best known freedom to indicate, at least at a surface level,
currently in use are the Washington Univer- some of the conflicts they experience and how
sity Sentence Completion Test (Loevinger, 1976; they perceive aspects of the world. Other clini-
1979), which assesses ego development and is pri- cians see sentence completion tests as good start-
marily a research tool, and the Rotter Incomplete ing points; they can ask the client “what did you
Sentences Blank (Rotter, 1946; Rotter & Rafferty, mean by item 6” or “tell me more about item 12”
1950), which is a measure of adjustment often and have the responses be the basis for a pro-
used in clinical settings such as college counsel- ductive therapeutic session. At the same, time,
ing centers (Holaday, Smith, & Sherry, 2000). there is recognition, supported by research, that
The Rotter contains 40 sentence stems, with 3 the responses to sentence completion tests are
forms: for high school, college, and adults. The less “deep” than ones elicited by the Rorschach
test can be easily administered, has no time limit, or other less structured projective tests. Some
can usually be completed in less than 30 min- investigators have argued that responses to the
utes, and can be administered individually or in Rotter are heavily affected by social desirability,
a group setting. It probably takes longer to score but others have not found support for this (e.g.,
the test than to administer it! Each response is Janda & Galbraith, 1973; McCarthy & Rafferty,
scored in a two-step sequence. First, the response 1971). For a review of some of the basic issues
is categorized as: omission (i.e. no response), con- involved in sentence completion tests, see Gold-
flict (indicative of hostility), positive, or neutral berg (1965). See the Miner Sentence Completion
(e.g. “I . . . am answering this test”). Then pos- Scale intended to measure the “motivation to
itive responses are scored 0, 1, or 2 and conflict marriage”; this scale also appears useful in studies
responses are scored 4, 5, or 6, using explicit def- of employee selection (Carson & Gilliard, 1993;
initions and examples given in the test manual. Miner, 1964).
Total scores can vary from 0 to 240, with the mean
score reported to be 127 (SD of 14).
Drawings
The reliability of this scoring system is quite
good with interscorer reliability coefficients in There are a number of projective tests that request
the .90s. Split-half reliability is also quite good, the client to draw something, for example, a per-
with typical coefficients in the .80s. Validity how- son (Machover, 1949), or a house, a tree, and a
ever, is somewhat questionable. Although there person (Buck, 1966), their own family (Hammer,
are studies that used the Rotter as a screening 1978), and so on. Some of these procedures
device to identify delinquent youths, for example, evolved out of attempts to measure intelligence
the number of false negatives (delinquent youths from a developmental perspective, using stim-
not identified as delinquents) and the number uli that could be applied cross-culturally with
of false positives (nondelinquent youths “iden- a minimum of verbal skills required either to
tified” as delinquents) can be quite high. This understand the instructions or respond to the
indicates that the test cannot be used for individ- task. Eventually these procedures were expanded
ual clients (Cosden, 1987). Correlations between to assess personality, especially within a psycho-
scores on the Rotter and various indices of adjust- dynamic framework.
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 403

Despite the fact that such procedures are quite Naglieri scoring system was normed on more
popular and often used by clinicians as part of a than 2,600 individuals aged 5 through 17, rep-
battery of tests, the experimental and psychome- resentative of U.S. Census data on a variety of
tric literature does not, in general, support their variables.
reliability and validity as measures of personal- With these newer revisions, reliability is gen-
ity and/or psychopathology. A typical reliability erally satisfactory although low-median internal
example is the study by S. Fisher and R. Fisher consistency coefficients range from .56 to .78
(1950), who evaluated the drawings of 32 para- (median of .70) for each of the drawings and a
noid schizophrenics; they found that the inter- median of .86 for the total test score; test-retest
rater reliability of trained psychologists was no coefficients range from .60 to .89, and interrater
better than for untrained raters, and that for both reliability is typically in the mid .90s (Kamphaus
the interrater reliability was quite poor. Validity & Pleiss, 1991). Validity correlations are substan-
studies fare no better, with many studies report- tially lower, with a median correlation coefficient
ing inconclusive or negative findings. Yet these of .57 between the draw-a-person and standard
techniques continue to be popular and attempts measures of intelligence. Wisniewski and Naglieri
are made to integrate research findings with (1989) obtained a correlation of .51 between DAP
clinical practice (e.g., Riethmiller & Handler, total scores and WISC-R Full Scale IQ in a sam-
1997). ple of 51 school children; their mean IQ on the
WISC-R was 99.5 and on the DAP it was 95.2.
Draw-A-Man Test and the DAP. Originally Thus, from the perspective of intelligence or cog-
developed by Goodenough (1926), this was a nitive development, the DAP can be a legitimate
simple test based on even simpler assumptions. part of the clinician’s repertoire of instruments.
The child was required to draw a man (rather than It is nonthreatening and can be a good “conver-
a woman, because women’s clothing was much sation opener” with a young client. Psychomet-
more variable and thus difficult to quantify). It ric difficulties arise however, when the test (or a
was assumed that children draw what they know variant) is used as a projective technique to elicit
rather than what they see, and that up to age 10 information about the client’s functioning.
the quality of a child’s drawing reflected intellec-
tual development. Given the historical context, Gender-role identification. The DAP is fre-
Goodenough’s efforts were indeed good enough – quently used as a measure of gender role identi-
she collected a normative sample of almost 4,000 fication. That is, it is assumed that the “normal”
children and used standardized procedures. She response to the instructions of “draw a person” is
developed a scoring scale that could be used to to draw a figure of one’s own gender. Most of the
assess the child’s drawing in terms of cognitive time this is a reasonable assumption supported by
development, and so an IQ could actually be the research literature. However, opposite gender
computed. drawings are frequently obtained from women as
A revision of this test came about in 1963 well as from young school-aged boys. Numerous
(the Goodenough-Harris Draw-A-Person Test explanations have been proposed, including the
(DAP); D. B. Harris, 1963), where a drawing of a notion that women in our culture are ambiva-
woman was added, and an attempt was made to lent about their gender identification. Farylo and
extend the scale to the adolescent years. In 1968, Paludi (1985) point out a number of method-
Koppitz developed a scoring system that became ological problems including the observation that
very popular, and in 1988, Naglieri presented a masculinity-femininity is often assumed to be
revision and update, in which three drawings (a a bipolar dimension (vs. the notion of several
man, a woman, and oneself) are produced and types including androgynous and undifferenti-
scored on 64 aspects that include the number of ated individuals), that the gender of the admin-
body parts, their location, and their proportion. istrator has an impact, and that in our culture
The result of this quantitative scoring system is “draw a person” may well be equated with “draw
a standardized score for each of the drawings as a man.”
well as a total test score, with a mean of 100 and The DAP has also found substantial use in
SD of 15, just like most intelligence tests. The studies of attitudes toward target groups such as
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

404 Part Four. The Settings

dentists (S. Phillips, 1980), scientists (Chambers, Description. The test consists of nine geomet-
1983), and computer users (Barba & Mason, ric designs, originally developed to illustrate the
1992). tendency of the perceptual system to organize
visual stimuli into whole figures or “gestalts.”
Each design is on an individual card and pre-
The Bender Visual Motor Gestalt Test sented one at a time; the subject is asked to copy
it on a piece of paper. Bender (1938) believed
Introduction. As with drawings, the Bender-
that the quality of the reproduction of the designs
Gestalt is used for two distinct purposes with
varied according to the level of motivation of the
adult clients: a measure of neuropsychological
subject and according to the pathological state
functioning and a projective test to assess per-
of the subject, such a state being “organic” (i.e.,
sonality functioning. From a psychometric point
the result of damage to the central nervous sys-
of view, the evidence supports its use as a measure
tem) or “functional” (i.e., no actual brain damage
of neuropsychological functioning, but substan-
but the abnormal behavior serving some psycho-
tially less so as a projective device. With children,
logical purpose). The test was originally used as
it is also used for two purposes: a test of visual-
a clinical and research instrument with adults,
motor development and a projective personal-
but in the 1950s began to be used with chil-
ity technique. There is considerable evidence to
dren and was presumed to measure visual-motor
support the validity of the Bender-Gestalt as
integration.
a measure of visual-motor development (Tolor
& Brannigan, 1980), but less so for its use to
Administration. Although there are no standard
assess children’s affective or behavioral disor-
directions, various authors have provided guide-
ders (Dana, Feild, & Bolton, 1983). Visual-motor
lines (e.g., Lezak, 1983). For example, the client
development parallels various aspects of intelli-
is ordinarily told that there will be nine such
gence such as memory, visual perception, spatial
cards. The test is considered appropriate for all
aspects, and so on, so the test is potentially an
age groups, from 3 years through adult, and can
indicator of the intellectual status of a child from
be administered on an individual or group basis
a developmental perspective.
(e.g., Keogh & Smith, 1967; Siebel, W. L. Faust,
The Bender-Gestalt is one of the most widely
& M. S. Faust, 1971). Sometimes, the Bender-
used tests; in fact, in a 1969 survey, it was
Gestalt is used as a “warm up” test, before admin-
used more frequently than the WAIS or the
istering an entire battery, because it is fairly
Rorschach (Lubin, Wallis, & Paine, 1971). Why
simple and non-threatening.
this popularity? There are probably four major
reasons: (1) the test is easy to administer and rel-
Scoring. A variety of scoring systems have been
atively brief; (2) the Bender-Gestalt has, as men-
developed for the Bender-Gestalt, primarily con-
tioned above, two distinct purposes; (3) the test
cerned with the accuracy and organization of the
can be useful as a screening device to determine
drawings. Some of the aspects that are considered
if more in-depth assessment is needed; and (4)
are the relative size of the drawings compared to
the clinician can obtain considerable information
the stimuli, their location on the sheet of paper,
from the test situation by carefully observing the
and any inversions or rotations of the figure. The
client’s behavior (see Piotrowski, 1995).
Pascal and Suttell (1951) scoring system for adult
subjects is one of the better known and includes
Development. The Bender-Gestalt was devel- 106 different scorable features of the drawings,
oped in 1938 to investigate how concepts from where each abnormal response is given a numer-
Gestalt psychology applied to the study of per- ical value. Typical test-retest reliabilities over a
sonality and brain injury and was initially admin- short time period of 24 hours are in the .70s, and
istered to many different clinical samples such interscorer reliability of trained scorers is typi-
as mentally retarded, and brain-damaged indi- cally in the .90s.
viduals (Bender, 1938). However, there was no Another popular scoring system, this one for
standard scoring procedure presented or any sys- the protocols of children, was developed by
tematic data analysis of the results. Koppitz (1964; 1975). Koppitz identified a group
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 405

of 13 “pathognomonic” signs that she felt were Validity. The validity of the Bender-Gestalt as a
indicators of emotional disturbance in children, measure of neuropsychological functioning was
each of these signs being relatively independent reviewed by Tolor and Brannigan (1980) who
of the child’s visual-motor functioning. Some of concluded that the test was remarkably effective
these indicators include the use of dashes rather in discriminating between psychiatric patients,
than dots, and the size of the drawing in compar- brain-damaged patients, and normal controls,
ison to the actual figure. The presence of three with hit rates typically above 70%. However, the
or more of these signs in a child’s protocol indi- validity of the Bender-Gestalt as a measure of
cated serious emotional disturbance. The Kop- personality and/or emotional disturbance seems
pitz scoring system takes into account that the to be typical of other projective techniques –
number of errors decreases rapidly between ages difficult to prove psychometrically, with mixed
5 and 8; the decrease levels off between ages 9 to results at best (Whitworth, 1984). Clearly, the
11, and so there is some dispute as to whether the Bender-Gestalt has “face” validity as a measure
scoring system is applicable to children beyond of visual-motor integration, and factor-analytic
age 8 or 10 (R. L. Taylor, Kauffman, & Partenio, studies do support its structure as a measure of
1984). There is some support however, for the visual-motor perception (e.g., Becker & Sabatino,
validity of the 13 pathognomonic signs. Rossini 1973). A number of studies suggest that scores on
and Kaspar (1987) for example, studied three the Bender-Gestalt are related to a wide variety
groups of children, 40 in each group aged 7 to 10. of variables, such as achievement in reading and
The three groups were presumably normal chil- in arithmetic, general intellectual competence,
dren, children with mild problems, and children and various other measures of visual-motor inte-
with chronic psychological problems. All Bender- gration (e.g., Aylward & Schmidt, 1986; Breen,
Gestalt protocols were scored by an experienced Carlson, & Lehman, 1985; D. Wright & DeMers,
clinician, and 32 were independently scored by 1982). In at least one study, it was found that stu-
a second clinician. The interscorer reliability was dents with high Bender-Gestalt scores tended to
reported to be .92, and both groups of problem do well academically, but a prediction could not
children produced significantly more emotional be made of students with low scores (Keogh &
indicators than the normal children, although the Smith, 1967).
two problem groups could not be differentiated
from each other. An analysis of the 13 specific A classical study. L. R. Goldberg (1959) took 30
indicators, showed that only 3 were significantly Bender-Gestalt protocols from the files of a Veter-
related to psychopathology. Hutt (1977) devel- ans Administration (VA) hospital. For 15 of these
oped a scoring system to measure severity of psy- protocols there was substantial evidence that the
chopathology, and several other modifications patients had organic brain damage; for 15 of the
and scoring systems are available (e.g., Branni- patients there was no such evidence. He gave the
gan, Aabye, Baker, et al., 1995; E. O. Watkins, protocols to three groups of judges: 4 clinical psy-
1976). These systems seem to have adequate test- chologists, 10 psychology trainees, and 8 nonpsy-
retest reliability and high interscorer reliability chologists (e.g., secretaries), and asked each to
with trained scorers. For example, for the Kop- indicate whether the protocol was “organic” or
pitz system interscorer correlations are mostly in not and how confident they were in their judg-
the high .80s and low .90s, and test-retest coeffi- ment. Goldberg points out that for the psycholo-
cients are typically in the .50 to .90 range, with gists and the trainees that is precisely the task that
periods ranging from 1 day to 8 months. they encounter in their professional work, and all
of them were familiar with the Bender-Gestalt,
Norms. The 1974 norms presented by Koppitz an instrument designed to assess this diagnostic
are said to reflect a socioeconomic cross section, question. How correct were the psychologists?
and include blacks, orientals, and Hispanics. In 65%. Not bad. How correct were the psychol-
general however, the normative data available to ogy trainees? 70%. And the nonpsychologists?
the clinician is quite limited and not as extensive 67%! Interestingly, the nonpsychologists were
or representative as what is available on tests such more confident in their judgment than the psy-
as the Stanford-Binet and the Wechsler series. chologists or the trainees, and the trainees were
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

406 Part Four. The Settings

more confident than the psychologists. In addi- well as those of other candidates, into a regres-
tion, there was no relationship between a per- sion equation, and in fact compute her predicted
son’s diagnostic accuracy and their degree of con- GPA in graduate school. We can then accept or
fidence. Goldberg also asked a Bender-Gestalt reject her on the basis of the regression-equation
“expert” to diagnose the 30 protocols – the results. These two methods are known as clinical
expert’s diagnostic accuracy was 83%. L. R. Gold- and statistical prediction methods.
berg makes two points: (1) By rating all proto- Meehl (1954) wondered about the relative
cols as nonorganic, one could have obtained an accuracy of clinical judgment when compared to
accuracy of 80%; (2) By scoring the protocols statistical prediction, particularly because studies
according to the Pascal-Suttell method, one could of psychiatric diagnoses showed that such diag-
also have obtained an 80% accuracy rate. These noses were often somewhat unreliable, that is,
results do not necessarily argue against the use- there was less than unanimous agreement among
fulness of the Bender-Gestalt, but do support the psychiatrists. In diagnosing general conditions
notions that psychometric analysis is better than such as psychosis, agreement ranged from mid
impressionistic formulations, and that a test’s 60% to mid 80%, but for specific conditions the
usefulness is limited by the capabilities of the test agreement ranged from about mid 30% to mid
user. 60%. Meehl (1954) found 19 studies relevant to
this issue. Nine of the studies found statistical
A recent study. Bornstein (1999) carried out prediction to be more accurate; 10 of the stud-
a meta-analysis of studies that compared ies found no difference between clinical and sta-
objective/self-report measures of interpersonal tistical methods. No studies found the clinical
dependency vs. projective measures of the same method to be superior. A later study reviewed 45
variable, and found that projective tests such as such studies and essentially supported Meehl’s
the Rorschach and the TAT correlated higher with findings (Sawyer, 1966). An analysis of 136 stud-
external indices of dependency-related behavior ies (Grove, Zald, Lebow, et al., 2000) indicated
than did the objective tests (mean correlations of that on average “mechanical prediction tech-
.37 vs .31). More studies like this one are needed niques” (e.g., using regression equations) were
before we can objectively come to a conclusion about 10% more accurate than clinical (i.e., sub-
about the value of specific projective techniques jective) techniques. Mechanical prediction out-
for specific domains of inquiry. performed clinical prediction in 33% to 47%
of the studies examined, while in only 6% to
16% were clinical predictions more accurate than
SOME CLINICAL ISSUES AND
mechanical predictions.
SYNDROMES

Clinical vs. Statistical Prediction The Effective Therapist


Given the same information, which might One of the crucial issues in clinical psychology,
include or exclusively be test scores, how can that as well as in other fields where direct service is
information be combined so as to maximize the provided to clients, is the degree to which the
correctness of our prediction? For example, Mary effectiveness of the treatment is related to the
is an applicant to graduate school. As an under- effectiveness of the practitioner. Thus, it becomes
graduate, her GPA was 3.06, her GRE Verbal score important to identify effective therapists. White-
was 650, and her letters of recommendation con- horn and Betz (1960) did exactly that by com-
tain 46 glowing adjectives. We can put this infor- paring SVIB (see Chapter 6) responses of ther-
mation together using our clinical judgment and apists who were successful with schizophrenic
intuition, noting that although her GPA is some- patients and therapists whose improvement rates
what low for graduate work, her GRE score is with their patients were low. The result was a
quite respectable, and her letters of recommenda- 23-item scale called the AB scale, in this case,
tion quite positive. We conclude that she should type A referring to successful psychotherapists
be accepted with the prediction that she will do and type B to less successful. Type A therapists
relatively well. Or we can place her scores, as scored high on the Lawyer and CPA scales of the
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 407

SVIB, and low on the Printer and Mathematics- 4. Treatment planning where the test results are
Physical Science Teacher scales, whereas type Bs used to establish treatment goals and strategies
showed the opposite pattern. Type A were thus appropriate to the patient.
seen as individuals characterized by a problem- 5. Outcome monitoring, for example, to assess at
solving approach, one that included genuineness the end of a specified time period whether the
and respect, an ability to understand the patient’s patient requires further treatment.
experiences, and an expectation of responsible 6. Program evaluation measures to assess the
self-determination. therapeutic program itself rather than the indi-
Eventually, the hypothesis was suggested, vidual client.
based on some empirical evidence, of an interac-
7. Scales for family and marital functioning.
tion between therapist-type and client diagnosis,
such that type A therapists were more successful 8. General psychological tests, such as the MMPI,
with schizophrenic patients, and type B thera- to assess general emotional adjustment, neu-
pists were more successful with neurotic patients. ropsychological functioning, degree of psy-
Whether this is the case is debatable (Chartier, chopathology present, and so on.
1971; Razin, 1971). Unfortunately, the research 9. In-process measures used to assess specific
in this area has been complicated by the fact that aspects of the treatment program and the client’s
there are nine different versions of this scale, and progress in that program, for example, the
it is not always clear which version is being used, degree of acceptance that one has a drinking
although at least four of the scales are so highly problem.
intercorrelated that they can be considered alter-
nate forms (Kemp & Stephens, 1971). Quite clearly, specific tests can be used in
several of these categories. A number of such
instruments (in addition to the old standards
Alcoholism such as the MMPI) are available to assess alco-
holism and/or drug abuse, such as the Alcohol
The assessment of patients with alcoholism serves Expectancy Questionnaire (S. A. Brown, Gold-
at least three purposes: (1) The test information man, Inn, et al. 1980), the Alcohol Use Inventory
can result in individualized treatment to meet (Horn, Wanberg, & Foster, 1987), and the Per-
the patient’s needs, coping skills, psychodynamic sonal Experience Inventory (Winters & Henley,
strengths, risk factors, and so on; (2) The test 1989) (see Lettieri, Nelson, & Sayer, 1985, for a
information can allow a better match between listing of 45 such instruments).
client and available treatment options; and Probably the most popular type of scale is the
(3) The test results can serve to monitor the self-report concurrent measure i.e., use of inven-
course of therapeutic progress or lack of it (Allen tories that assume that alcoholics differ from
& Mattson, 1993). nonalcoholics, and that they therefore respond in
Allen and Mattson suggest that assessment consistently different ways to selected self-report
instruments for alcoholism can be subsumed
items. There are two major test construction
under nine categories:
strategies used here (W. R. Miller, 1976). The first,
or indirect scale strategy, involves the administra-
1. Screening tests to determine, for example, tion of a large pool of items, such as the MMPI,
whether an in depth evaluation is required. items which have little or no obvious relationship
2. Diagnostic tests to determine diagnostic status to drinking behavior. The responses of alcoholic
of, for example, alcohol-related conditions, such and control samples are then statistically analyzed
as drug abuse and brain deterioration. and a scale compiled of items whose responses
3. Triage tests (triage being a medical term refer- are statistically different in the two groups (this
ring to the assignments of patients to specific is of course the empirical approach discussed in
treatments) whose results might be used to deter- Chapter 4). The second strategy is the direct scale
mine the appropriate setting, such as hospitaliza- strategy, where the pool of items is directly rele-
tion vs. out-patient treatment, and the intensity vant to drinking and related behaviors i.e., items
of the treatment. have face validity.
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

408 Part Four. The Settings

Table 15–1. Dimensions on the EDI


Subscale Definition
1. Drive for thinness The wish to lose weight and the fear to gain weight
2. Bulimia The tendency to binge and to purge
3. Body dissatisfaction Believing specific body parts to be too large
4. Ineffectiveness Feelings of general inadequacy
5. Perfectionism Excessive personal expectations
6. Interpersonal distrust A sense of alienation
7. Interoceptiveness awareness Lack of confidence in one’s ability to recognize and identify
hunger and satiation
8. Maturity fears Fear of the demands of adulthood

A good example of the first approach is the limitations; such instruments do not measure
McAndrew (1965) scale of alcoholism derived motivation or beliefs about the benefits of recov-
from the MMPI. This is a 49-item scale, with ery and some other variables that may be related
the keyed items indicating that alcoholics report to success in treatment. In addition, most alco-
themselves to be outgoing and social, to have few holism measures have been developed with adult
problems with self-image, to have had school patients, rather than with adolescents. Most nor-
problems, and to experience physical problems mative groups do include women and minority
due to excessive alcohol intake. The two statisti- patients, but separate norms for these groups are
cally most discriminating items on this scale were rarely available.
face valid items (e.g., “I have used alcohol exces-
sively”) and were thus eliminated from the scale.
Eating Disorders
An example of the second approach is the Michi-
gan Alcoholism Screening Test (MAST; Selzer, As the name implies, eating disorders involve
1971), which contains 25 items with a high degree abnormal eating behaviors. Two major categories
of face validity (e.g., “Do you have troubles with are identified: anorexia, which involves extreme
drinking?”). The validity of both of these scales restriction of food intake, and bulimia which
remains an open issue. involves bouts of binge eating followed by vomit-
A second type of scale is based on the notion ing and/or laxative use. Two subtypes of anorexia
that there is an alcoholic “personality,” and thus are currently recognized: bulimic anorexics who
standard scales of personality, from the MMPI engage in binge eating, and restrictor anorexics
to measures of depression, have been used. Such who severely restrict their food intake. In the
studies have typically found significant differ- assessment and research of eating disorders, a
ences between alcoholics and controls, to the wide variety of scales are used. Some have to do
point that Keller (1972) concluded that the study with body dissatisfaction, others represent sur-
of any trait in alcoholics “will show that they vey checklists of weight reduction methods, and
have either more or less” of that trait! Mention still others are self-report questionnaires of eat-
should also be made of the physiological mea- ing attitudes and behaviors.
sures, such as dependence on alcohol, metabolic
rates, or effects on memory and sleep, which can The Eating Disorder Inventory-2 (EDI-2). The
be used diagnostically. EDI-2 (Garner, 1991) is a 91-item self-report
Allen and Litten (1993) reviewed a broad range measure of the symptoms, behaviors, and feel-
of psychological as well as laboratory tests (such ings related to bulimia and anorexia. The origi-
as blood tests that assess biochemical mark- nal EDI (Garner & Olmsted, 1984) consisted of
ers for alcoholism), to identify and treat alco- 64 items scored on 8 subscales, and was devel-
holics. These authors point out that such mea- oped with the recognition that eating disorders
sures have several advantages in addition to are multidimensional; the 8 scales are listed in
their primary use as screening instruments; these Table 15.1. The initial pool of items consisted of
include enhancement of the patient’s motiva- 146 items generated by clinicians familiar with
tion to change their behavior, and reinforcement eating disorders. Items were retained that dis-
of patient progress. They also point out some criminated between anorexic patient and control
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 409

samples, that correlated with the relevant sub- have also shown that specific subscales show sig-
scale more than with other subscales, and that nificant changes in patient samples as a function
had alpha coefficients above .80. The EDI-2 con- of specific treatments. Concurrent validity data
tains an additional 27 items scored on 3 pro- is also available that shows scores on the eight
visional subscales, as well as expanded norms original subscales to correlate significantly with
totaling almost 1,000 patients. The EDI-2 items scores on other eating-disorder inventories, with
cover both specific eating behaviors, such as diet- clinical ratings on the same dimensions, and with
ing, and more adjusted-oriented aspects, such body-image measures. A representative study is
as meeting parental expectations. Responses are that by Gross, Rosen, Leitenburg, et al. (1986)
given on a 6-point scale, ranging from “always,” who administered the EDI and another eating
through “often” and “sometimes,” to “never,” attitudes test to 82 women diagnosed as bulimic.
although the actual scores range from 0 to 3, with Both instruments discriminated bulimic patients
responses of “never,” “rarely,” or “sometimes” all from control subjects, despite the fact that many
given a zero weight. The scores are recorded on a subscales on both tests did not correlate signif-
profile form that allows direct comparison with icantly with each other; for example, the EDI
patient and female college-student norms. subscale of “bulimia” did not correlate signifi-
The primary purpose of the EDI-2 is to aid cantly with the other test’s subscale of “dieting.”
clinicians to assess patient symptoms, to plan Many of the subscales also did not correlate sig-
treatment, and to evaluate the effectiveness of nificantly with behavioral measures such as the
therapeutic interventions. It can also be used as a amount of calories consumed during three stan-
screening inventory to identify at-risk individu- dardized meals. The authors concluded that both
als, to assess the incidence of symptoms in a target self-report measures and direct behavioral mea-
sample, or to study from a research point of view sures of eating and vomiting would be needed,
what specific changes may be related to various and that their results supported the criterion
therapeutic modalities. validity of the tests, but showed only partial sup-
The EDI-2 is relatively easy to administer port for the concurrent validity.
and score. It is appropriate for both males and Some construct-validity data is also available;
females, primarily for adolescents and adults, but for example, several studies have obtained eight
can be used for children as young as 12. The EDI- factors on the EDI that correspond to the eight
2 takes about 20 minutes and can be administered original scales, but some have not (e.g., Welch,
individually or to groups. The answer sheet con- Hall, & Norring, 1990). Investigators in New
tains a carbon page that allows a direct translation Zealand administered the EDI to three samples
of chosen responses into scores on the appropri- of nonpatient women: 192 first-year psychology
ate scales. Scoring is thus simply a clerical task. students, 253 student nurses, and 142 aerobic
The obtained raw scores can be changed to per- dance-class enrollees. A factor analysis did not
centile ranks using patient or nonpatient norms. confirm the original eight scales, but indicated
For the eight original EDI scales, internal con- three factors in each of the samples: a factor focus-
sistency reliability is quite adequate ranging from ing on concern with body shape, weight, and eat-
the low .80s to the low .90s for patient samples ing, a factor of self-esteem, and a factor of per-
both in the United States and in other coun- fectionism (Welch, Hall, & Walkey, 1988). Not all
tries (e.g., Norring & Sohlberg, 1988). For the EDI scales show equal degree of validity, and fur-
three provisional scales, the reliabilities are sig- ther research is needed to indicate which scales
nificantly lower, with internal-consistency alpha are indeed valid and useful.
coefficients falling below .70 for two of the three
scales. Test-retest reliability is also adequate, with
HEALTH PSYCHOLOGY
coefficients in the mid .80s for short periods (1 to
3 weeks), and substantially lower for longer time Psychology and medicine. Originally, the inter-
periods. face between psychology and medicine occurred
Much of the validity of the EDI-2 is criterion- in the area of mental health, but recently there has
related validity, with several studies showing eat- been great concern with the behavioral factors
ing disorder patient samples scoring significantly that affect physical health and illness. Thus in the
higher than control samples. A number of studies 1970s the field of health psychology developed in
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

410 Part Four. The Settings

a very vigorous way into a subarea of psychology seem central to this constellation: a hardy person
that is in many ways distinct from clinical psy- is committed to his or her work and activities (i.e.,
chology (Stone, Cohen, & Adler 1979). Health has a sense of belief in himself or herself and the
psychology includes a variety of aspects, ranging community), has a sense of control (i.e., the belief
from the study of the processes by which behavior that he or she can influence the course of events),
is linked to physical disease, to the study of the and sees life as a challenge (that changes are an
physiological effects of stressors that can influ- opportunity to grow rather than a threat to secu-
ence susceptibility to disease. rity). The hypothesis is that hardiness mediates
or acts as a buffer of stressful life events, or per-
Primary focus. Health psychology is primarily haps alters the way in which stressful life events
concerned with the psychological aspects related are perceived (Alfred & Smith, 1989). Originally,
to the maintenance and promotion of good the Hardiness Test was composed of six scales
health, and the prevention and treatment of ill- from other tests that were considered reliable
ness, as well as with issues such as health-care sys- and valid. Raw scores for each of the scales were
tems and health-policy formulation (Matarazzo, changed to standard scores and these standard
1980). Among the major questions asked by scores added into a total score, so each scale was
health psychologists are how to keep people given equal weight. The six scales actually made
healthy (especially in terms of diseases such as up three dimensions: control, commitment, and
lung cancer and alcohol and drug abuse), the challenge. These three dimensions had internal
role of personality factors and coping styles in the consistency, reliability, and test-retest reliabili-
development of illness, the role of stress, and the ties all near .70 (Kobasa, Maddi, & Kahn, 1982).
benefits of social supports (see Stokols, 1992, and Subsequently the Hardiness Test was revised and
S. E. Taylor, 1990, for comprehensive overviews). shortened to 50 items, on the basis of factor anal-
ysis. A study comparing the Hardiness Test with
The healthy person. In the 1940s, Abraham two scales from the Personal Orientation Inven-
Maslow began to study psychologically healthy tory, reported modest correlations between the
individuals, as he believed that one could not two instruments, and the authors concluded that
understand mental illness unless one understood there was a relationship between the two instru-
mental health first. Eventually, he was able to ments (J. M. Campbell, Amerikaner, Swank,
describe the “fully functioning” person and to et al., 1989).
label this process as self-actualization (Maslow, Questions as to whether hardiness is a unidi-
1970). Shostrom (1963; 1964) developed a per- mensional or multidimensional construct have
sonality inventory called the Personal Orienta- not been resolved, and indeed the very concept is
tion Inventory, that was based on Maslow’s ideas, a debatable one (e.g., Funk & Houston, 1987).
and operationalized self-actualization (Knapp,
1976; R. R. Knapp, Shostrom, & L. Knapp, 1978). Physical fitness. Another major area under the
A somewhat different but related approach topic of health psychology, and interfacing with
is represented by the work of Suzanne Kobasa other areas like sports psychology, is the relation-
(1979) who developed the Hardiness Test. Peo- ship of physical fitness training to improvements
ple become ill for a wide variety of reasons, rang- in psychological functioning, and in particular
ing from genetic defects or predispositions to to variables such as body image, cognitive func-
environmental happenstance, such as becoming tioning, self-concept, sleep, the reduction of anx-
exposed to someone with a contagious illness. iety, and so on. The vast majority of these studies
Stressful life events do contribute to the devel- use psychological tests to either categorize the
opment of physical illness, although some peo- independent variable, such as degree of physi-
ple seem to be less vulnerable to such negative cal fitness, or assess the dependent variables such
effects. One explanatory approach is to postulate as increases in self-esteem (see Folkins & Sime,
that such resistance is due to “hardiness,” a con- 1981, for a review of some issues in this area).
stellation of personality characteristics (e.g., Hull,
Van Treuren, & Virnelli, 1987; Kobasa, Maddi, & Self-reports revisited. A psychologist named
Kahn, 1982). There are three characteristics that Woodworth developed the Personal Data Sheet
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 411

during World War I, as a self-report inven- we discussed in Chapter 7 could be listed here as
tory designed to screen out emotionally unsta- examples of response-oriented measures – such
ble recruits. This was the prototype of all sub- as the MMPI, the Beck Depression Inventory, and
sequent self-report instruments, and although the Spielberger State-Trait Anxiety Inventory.
current ones are much more psychometrically 3. A third group of theories might be labeled
sophisticated, they still reflect much of Wood- as “interactionists.” They see the person as being
worth’s approach. the major mediating link between the charac-
As I have noted before, self-reports are eco- teristics of the environment and the responses
nomical in that they do not require a well-trained made. The approach here is that not only does
professional for administration and scoring, and the environment have an impact on the individ-
are amenable to psychometric treatment, such ual, but the individual can have an impact on the
as scoring and interpretation by computer. They environment. Because this is a rather complex
are typically brief, usually inexpensive, and reflect and dynamic point of view, psychometric assess-
the client’s experience through their own assess- ment has so far not reflected this approach in
ment rather than through an observer. This last any significant way, with some exceptions. One
point can of course be detrimental, in that some such exception is the Jenkins Activity Survey to
professionals are skeptical about the accuracy of measure “Type A” behavior, discussed next.
self-reports. For an overall review of self-report
measures of stress see Derogatis (1987). Health Belief Model. The theory that has prob-
ably generated the most instruments to measure
Stress. One area of health psychology concerns attitudinal components of health behaviors is
the relationship between stress and subsequent the Health Belief Model, developed in the 1950s
susceptibility to disease. All individuals experi- and 1960s (e.g., Janz & Becker, 1984; Rosenstock,
ence events or changes in their lives that may be 1974). The model developed out of social psycho-
potential stressors because they require coping, logical concerns as to why people were not using
readjustment, and adaptation. The energy used newly developed procedures for the detection of
to cope with such stressors is assumed to rob the cancer, rheumatic fever, and other illnesses. The
body of resistance to disease, and thereby increase model postulated that the likelihood of a person
the probability of physical illness and interper- taking action to avoid disease was a function of
sonal dysfunction. The field of stress is rather het- subjective beliefs along several dimensions: sus-
erogeneous, but for our purposes we can classify ceptibility to the disease, severity of consequences
theories of stress into three categories (Lazarus, of the disease, benefits of taking the recommended
1966): health action, and the barriers related to that
1. Stimulus-oriented theories view stress as action. Other dimensions, such as motivation,
being in selected aspects of the environment. were later added.
Tests that reflect this point of view, attempt to cat- Unfortunately, many of the scales developed
alog and assess aspects, such as the “life events” to empirically assess facets of this model were
that a person experiences such as school exams, not evaluated for reliability and validity, and were
automobile accidents, divorce, and so on. Per- criticized for their psychometric limitations (e.g.,
haps the best known measure that illustrates Champion, 1984).
this approach is the Holmes and Rahe Schedule
of Recent Experience, discussed next. Other life Life-events research. The notion that life events,
event measures include the Life Experience Sur- such as marriage or an illness, can have an impact
vey (Sarason, Johnson, & Siegel, 1979), and the on the individual is not a new idea; the focus
Impact of Life Event Scale (Horowitz, Wilner, & on measurement goes back to the 1950s and
Alvarez, 1979). the work of Holmes, Rahe, and their colleagues
2. Response-oriented theories define or focus (Hawkins, Davies, & Holmes, 1957; Holmes &
on stress as the response of the individual. Tests Rahe, 1967; Rahe, Meyer, Smith, et al., 1964).
and inventories here focus on the affect or mood These investigators proposed that the readjust-
of the person, their coping patterns, and so on. ment required by major life changes such as a
Almost all the measures of psychopathology that divorce or a change in jobs substantially increased
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

412 Part Four. The Settings

the risk of physical illness (Holmes & Rahe, 1967). psychological symptoms. Part of the original the-
They developed the Schedule of Recent Expe- oretical framework was that life events, whether
rience (SRE), a 43-item questionnaire designed positive or negative, were stressful. Subsequent
to measure the incidence or occurrence of life research has shown that the impact is a function
events, which were as a total, significantly asso- of the degree of aversiveness of the event. Two
ciated with physical illness onset. In fact, these events such as marriage and divorce, may reflect
items were generated from a larger list of life equal amounts of disruption, but the stressful-
events that were clinically observed to cluster ness is related to the negative event and not the
at the time of disease onset; these events were positive one. Indeed, an important component
taken from the charts of more than 5,000 medi- is how the individual perceives the event (e.g.,
cal patients. This instrument underwent a num- Horowitz, Wilner, & Alvarez, 1979; Zeiss, 1980).
ber of changes (e.g., B. S. Dohrenwend & B. P. Another issue is the weighing of the indi-
Dohrenwend, 1978; Ross & Minowsky, 1979) vidual items. Originally, different weights were
including name changes (e.g., the Social Read- used through a technique called “direct magni-
justment Rating Scale); it has been central to a tude estimation” in which each item was assigned
research program on the effects of life changes on a mean “life change” score. Several investiga-
subsequent physical, medical, and psychological tors have shown that alternative scoring schemes,
conditions ranging from cardiovascular disease using unitary weights (i.e., every item endorsed
and death (e.g., Adler, MacRitchie, & Engel, 1971; is counted as one), or using factor weights (i.e.,
Rahe & Lind, 1971), the onset of diabetes (S. P. items are scored to reflect their loading on a fac-
Stein & Charles, 1971), complications with birth tor) are just as predictive, if not more so (e.g.,
and pregnancy (e.g., Gorsuch & Key, 1974), to Grant, Sweetwood, Gerst, et al., 1978; Lei & Skin-
more general susceptibility to illness (e.g., Marx, ner, 1980). In most studies where different scor-
Garrity, & Bowers, 1975). However, the evidence ing techniques are compared, weighted scores
suggests that the relationship between major life and unitary weight scores typically correlate in
changes and health outcomes is modest at best, the low to mid .90s (e.g., G. R. Elliot & Eisdorfer,
with the average correlation coefficient about .12 1981; M. Zimmerman, 1983). Might the weights
(Rabkin & Streuning, 1976). of the individual items change across time? Scully,
The list of items was originally given to 394 Tosi, and Banning (2000) found that 14 of 43
subjects, who were asked to rate each item as events did show a significant shift in weight, but
to how much social readjustment was required there was a correlation of .80 between the original
for each event. As an anchor, they were told that and new weights.
marriage was given an arbitrary value of 500. A third issue concerns the dimensionality of
These values or ratings were termed “LCUs” or these life-event questionnaires. Originally, they
life change units. The mean score for each life were presented as unidimensional, with endorsed
event was then calculated and divided by 10 in events contributing to an overall stress score. Sub-
order to have the items fall on a scale from a sequent studies have presented evidence of the
theoretical 0 to 100. Thus marriage has a value multidimensionality of such scales (e.g., Skinner
of 50. The highest-rated items were “death of a & Lei, 1980).
spouse” at 100, and “divorce” at 73. Average items How stable are life-change scores over time?
were “son or daughter leaving home” and “trou- In a 2-year study, the rank ordering of the
ble with in-laws,” both rated at 29. Lowest items amount of readjustment required by life events
were “Christmas” at 12, and “minor violations of remained quite consistent for both male psychi-
the law” at 11. Ratings of the various items were atric patients and normal controls, with correla-
fairly consistent among different subgroups, such tion coefficients ranging from .70 to .96. There
as men vs. women, or older vs. younger subjects. was also stability in the absolute weights given,
The fundamental assumption of this approach but only in the normal controls. These results
is that major life-change events result in stress. suggest that for normal individuals, the percep-
Experiencing such changes requires adjustments tion of the impact of life changes is relatively sta-
that inhibit the body’s natural resistance to ill- ble, but the same cannot be said for psychiatric
ness, with the end result being physical and/or patients (Gerst, Grant, Yager, et al., 1978).
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 413

What about cross-cultural results? Holmes and outcomes that might then result in spuriously
Rahe (1967) suggested that the items in their scale high correlations; basically, this is a variation of
were of a universal nature and should therefore be criterion contamination discussed in Chapter 3.
perceived similarly in different cultures. There is Specifically, Hudgens (1974) argued that 39 of
some support for this, with studies of Japanese the 43 life events could be viewed as symptoms
(Masuda & Holmes, 1967), Mexican-Americans or consequences of illness rather than precipitat-
(Komaroff, Masuda, & Holmes, 1968), and ing events.
Swedes (Rahe, Lundberg, Bennett, et al., 1971), 4. Internal reliability also presents a challenge
with samples yielding fairly similar results. here. The experience of one event, for exam-
P. Baird (1983) criticized the Holmes and ple a vacation, is not intrinsically or theoreti-
Rahe scale as including too few categories of life cally related to the experience of another event,
events (i.e., lacking content validity), being biased for example getting a divorce. Because the test
against unmarried persons (more than 15% of items are independent we would not expect a
the items pertain to spouse or marital circum- high Cronbach’s alpha. On the other hand, it
stances), having many vague items, and having can be argued that subsets of events are linked
some items that could reflect the consequence of together as a function of a person’s “ecological
illness rather than the cause (for example, severe niche.” Thus, an uneducated individual living in
illness could result in difficulties on the job and the inner city is more likely to experience unem-
not necessarily the other way around). ployment, divorce, poor living conditions, and so
The publication of the Holmes & Rahe scales on.
was followed by a huge amount of research, as 5. A related problem is that the potential occur-
well as substantial debate about the adequacy rence of stressful events is related to one’s role,
of this and alternative approaches (e.g., Moos & “occupancy.” As R. J. Turner and Wheaton (1995)
Swindle, 1990; Raphael, Cloitre, & Dohrenwend, point out, the person who is unemployed is not
1991; Sandler & Guenther, 1985). As criticisms at risk of being fired or having conflicts with the
were made, newer versions of such checklists were boss.
developed; M. Zimmerman (1983), for example, 6. The original instructions on the Holmes &
cited 18 such life-event inventories. Rahe scale asked the respondent to consider what
Turner and Wheaton (1995) identified nine they had experienced during the “past year.” In
key issues regarding checklists to measure stress- some ways this was an arbitrary time period, but
ful life events: in part it was assumed that the effects of increased
1. How does one define and select life events stress would show up in about a year’s time. In
to be included in a checklist? This includes the fact, the assumption appears questionable, and
issue of content validity, the notion that different the empirical evidence suggests higher correla-
subsets of items may not be germane to specific tions between events and physical illness when
samples – for example “death of a spouse” may the time frame is longer than 1 year.
be quite relevant to the elderly but of relative rare 7. R. J. Turner and Wheaton (1995) note that
occurrence among college students. events are salient for their stress-evoking poten-
2. Is change per se, whether positive or nega- tial for varying periods of time. Some events are
tive in nature, what matters? Or is it the unde- very discrete and others are much more enduring.
sirability of an event? For example, many scales For example, a vacation is typically brief, whereas
include items such as “vacation” or “promotion” financial difficulties often can be more chronic.
which for most people would be quite positive Some events are discrete, such as the death of
in nature. The evidence does suggest a pattern a spouse, but the effect is much more chronic.
of positive relationships between eventual illness Thus, an effective checklist should assess not just
and the occurrence of negative events, whereas the occurrence of the event, but also its time
the pattern is weak and contradictory in relation duration.
to positive events (Zautra & Reich, 1983). 8. Issue of weights. The two most common
3. A number of items that appear on check- approaches to weighing of items is to use the aver-
lists are themselves indicators or symptoms of ill- age LCU weight assigned by a sample of raters,
ness. There is thus a confounding of events with or to use the individual’s subjective ratings of
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

414 Part Four. The Settings

such LCU. Neither approach however, results in a on intensity (i.e., how often: somewhat, moder-
more significant result than a simple counting of ately, or extremely often). Here also, no indica-
the number of events endorsed. From a psycho- tion is given as to what psychometric considera-
metric point of view this makes sense – all items tions, if any, entered into the construction of the
are equal. But from a theoretical or logical point scale.
of view, the argument that the death of a spouse The two scales were administered once a
and a vacation produce equal amounts of stress month for 9 months to a sample of 100 adults. For
is somewhat awkward. the hassles scale, the average month-to-month
9. Finally, there is the issue of reliability. Most reliability coefficient was .79 for the frequency
of the concerns discussed in Chapter 3, such as score and .48 for intensity; for the uplifts scale
the effect of memory on short test-retest periods, the coefficients were .72 and .60, respectively.
apply to these checklists. You recall that the usual Thus, these subjects experienced roughly the
procedure for test-retest reliability is to compare same number of hassles and uplifts from month
total scores obtained at two points in time. Total to month, but the amount of distress or plea-
scores that are equal may, however, reflect differ- sure varied considerably. From a psychometric
ent constellations of endorsed items, and so R. J. point of view, these coefficients might be consid-
Turner and Wheaton (1995) argue that reliability ered test-retest coefficients, and at least for inten-
should be considered at the individual item level sity they fall short of what is required. Hassles
rather than at the total score. Yet, recall that relia- and uplifts scores are also related to each other –
bility reflects test length, and the reliability of one with a mean r of .51 for frequency and .28 for
item can be quite low (see Kessler & Wethington, intensity perhaps reflecting a response style (see
1991, for a different approach to assessing reli- Chapter 16) or a tendency for people who indi-
ability). Despite all these criticisms, the Social cate they have many hassles also to indicate they
Readjustment Rating Scale continues to be one have many uplifts. These investigators and oth-
of the most widely cited instruments in the stress ers (e.g., Weinberger, Hiner, & Tierney, 1987)
literature and is judged to be a useful tool (Scully, have reported that hassles are better predictors
Tosi, & Banning, 2000). of health status than major life-change events.

Hassles and uplifts. Another approach is repre-


Health Status
sented by the work of Lazarus and his colleagues
(e.g., Lazarus, 1980), who focus on the relatively Since the 1970s, a number of measures have been
minor stresses and pleasures of everyday life – developed to assess the physical and psychologi-
what they call the hassles and uplifts (Kanner, cal health of populations. Although the concept
Coyne, Schaefer, et al., 1981). These investiga- of “health” is a fairly simple one, it is also com-
tors developed a hassles scale and an uplifts scale. plex to define particularly because health care has
You recall from Chapter 8 that the hassles scale changed from a focus on increasing longevity to
consists of a list of 117 hassles such as, “trou- the “quality of life” that people have. Most of
blesome neighbors,” “too many responsibilities,” these instruments are designed to assess groups
and “the weather.” These were simply generated of individuals such as persons living in a partic-
by the research staff, using predetermined cate- ular community; thus, they are typically short,
gories such as work, health, and family but with simple, and easily administered. Their focus is
no indication of how the final items were selected. typically the absence of ill health rather than the
Respondents are asked to rate each item first presence of good health, that is, the focus is on
on whether the hassle occurred during the past how ill the person is rather than how well. Some of
month, and second, using a 3-point scale on the the measures are very general; they assess health
severity of the hassle (somewhat, moderately, or status quite broadly. Others are very specific in
extremely severe). that they focus on the presence or absence of a
The uplifts scale consists of a list of 135 uplifts, particular condition, such as cancer.
ranging from “daydreaming” to “looking forward From a measurement point of view, there are
to retirement.” Items on this list are also circled if a number of major problems and challenges to
they occurred in the past month, and also rated be met. One is that the same terms are often used
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 415

differently by different investigators, and differ- tioning (2 items); vitality (4 items); and general
ent terms are used as synonyms. For example, health perceptions (5 items).
some investigators equate measures of quality of The validity of the subscales has been largely
life with measures of health status, and some do determined using criterion groups of medi-
not. Another issue is the sensitivity of these scales cal and psychiatric patients, as well as against
to clinical change. That is, a measure of health the long-form version. In one study (McHor-
status should not only show the usual reliability ney, Ware, & Raczek, 1993), four groups of
and validity, but should also be sensitive to detect patients were compared: patients with minor
those differences that may occur as a function of chronic medical conditions; patients with serious
treatment, such as chemotherapy with cancer; in chronic medical conditions; patients with psychi-
fact, the evidence suggests that many brief mea- atric conditions; and patients with both serious
sures of health status are sensitive to such changes medical and psychiatric conditions. The results
(e.g., J. N. Katz, Larson, Phillips, et al., 1992). indicated that the SF-36 functioned as hypothe-
Another issue is length. Even a 15-minute ques- sized, in discriminating between degree of med-
tionnaire may be too long when one is assessing ical conditions (i.e., severely ill, moderately ill,
the elderly, the infirm, people undergoing surgi- and healthy groups), and between medical and
cal treatments, and so on. psychiatric conditions.
These health-status measures have generally It is too early to tell about the cross-cultural
four purposes: (1) to examine the health of gen- validity of this scale. There is a project underway
eral populations; (2) to examine the effects of designed to study the applicability of the SF-36
clinical interventions; (3) to examine changes in 15 different countries (Aaronson, Acquadro,
in the health-care delivery system; and (4) to & Alonso, 1992).
examine the effects of health-promotion activi- Some studies have reported limitations of this
ties (Bergner & Rothman, 1987). Among the bet- scale. For example, there is a “floor” effect in
ter known health status measures are the Activ- severely ill samples, where 25% to 50% of the
ities of Daily Living (Katz, Ford, Moskowitz, sample obtained the lowest possible score, and
et al., 1963), the General Health Survey (A. L. scores on some of the subscales do not correlate
Stewart, Hays, & Ware, 1988), and the Sickness highly with the criterion when the criterion (e.g.,
Impact Profile (Bergner, Bobbitt, Carter, et al., degree of control over diabetes) is measured on
1981; Bergner, Bobbitt, Kressel, et al., 1976); for something other than a categorical (e.g., yes or
a review of the major scales see R. T. Anderson, no) scale (R. T. Anderson, Aaronson, & Wilkin,
Aaronson, and Wilkin (1993). 1993). Content validity can also be questioned.
For example, the physical activity items cover
only gross-motor activities, such as walking and
The Short-Form General Health Survey (SF-36). kneeling, and not activities such as cleaning and
Originally, this was developed as a 20-item form shopping that may be influenced by the presence
and later expanded to 36 items (A. L. Stewart, of illness.
Hays, & Ware, 1988; Ware & Sherbourne, 1992).
The SF-36 has two aims: to represent the multidi-
The Sickness Impact Profile (SIP)
mensional concept of health, and to measure the
full range of health states. The SF-36 was actually The SIP is designed to measure the changes in
one of several forms developed as part of a medi- behavior that occur as a function of sickness,
cal outcomes study, a large-scale study to monitor and focuses on behavior and the respondent’s
the results of medical care. The test manual and perception. The test is clearly geared for medi-
U.S. norms have recently been published (J. E. cal patients, particularly the elderly, who may be
Ware, Snow, Kosinski, et al., 1993). struggling with chronic diseases and other major
The SF-36 contains eight subscales: physi- illnesses.
cal functioning (10 items); role limitations due The SIP covers a wide range of functioning in
to physical problems (4 items); bodily pain (2 12 areas by presenting a set of 136 yes or no state-
items); mental health (5 items); role limitations ments. The areas include eating, sleep, alertness,
due to emotional problems (4 items); social func- body care, and mobility. Scores are obtained for
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

416 Part Four. The Settings

each area, three summed indices, as well as an The McGill Pain Questionnaire
overall score. Lower scores indicate more desir- (McGill PQ)
able outcomes; with items weighted by the sever-
ity of health impact. The standard form of the SIP The McGill PQ, published by Melzack, a psy-
is administered by an interviewer and takes about chologist at McGill University, was designed to
30 minutes; there is also a self-administered form provide a quantitative profile of clinical pain.
and a mail version, although the interviewer form Originally, it was intended as a way to evaluate
is more reliable (Pollard, Bobbitt, Bergner, et al., the effectiveness of different pain therapies, but
1976). One of the somewhat unique aspects of it is often used both clinically and in research
the SIP is that test administrators need to under- applications as a diagnostic tool (Melzack, 1975).
take a self-training procedure, as outlined in the The McGill PQ stems from a theoretical basis
test’s training manual. that postulates three major psychological dimen-
Handscoring of SIP protocols is awkward at sions to pain: a sensory-discriminative dimen-
best. Consider for example the item, “I sit during sion (for example, how long the pain lasts), a
much of the day.” If endorsed, this item is worth motivational-affective dimension (for example,
4.9 points; the score for each area is calculated the fear associated with pain), and a cognitive-
by adding the scale values for each item endorsed evaluative dimension (for example, how intense
within that category and dividing by the maxi- the pain is).
mum possible “dysfunction” score for that cat- Melzack realized that the experience of pain
egory. This maximum value is provided; for the is a highly personal experience, influenced by
item above, for example, which falls in the “Sleep individual differences, perceptions, and cultural
and Rest” category, the maximal value is 49.9. aspects, and developed a “gate control” theory of
The resulting score is then multiplied by 100 to pain to explain how the “gating” or modulation
obtain the category score. Fortunately, computer of pain is possible (Melzack & Wall, 1965). This
scoring is available. theory is at the basis of the McGill PQ, although
Although there are 12 areas or subscales, the use of this test does not necessarily require accep-
SIP actually measures three dimensions: psy- tance of the theory.
chosocial functioning, assessed by the sum of 4 The first step in the development of the McGill
scales; physical functioning, assessed by the sum PQ was to establish an item pool, a set of 102
of 3 scales, and independent aspects, assessed by descriptive words related to pain, obtained from
the sum of 5 scales. the literature, patients’ pain descriptions, and
The SIP has been used in many studies, with a other questionnaires. These words were sorted
wide variety of medical and psychiatric patients, into the three categories, and then further sorted
and the results have supported its reliability and within each category into subgroups. For exam-
validity. For example, scores on the SIP psychoso- ple, the word “throbbing” was categorized in the
cial component correlate moderately (rs of .40 to temporal category, which is part of the sensory
.60) with more traditional measures of anxiety dimension; similarly, the word “shooting” is part
and of depression. It appears that the SIP may of the spatial category and also part of the sen-
be more sensitive to declines in health-related sory dimension (Melzack & Torgerson, 1971).
quality of life than in improvements. The SIP is Patients and physicians were then asked to rate
available in several languages including French, each item as to intensity using a 5-point numer-
German, and Norwegian, and studies in different ical scale from least to worst. In addition to the
countries have yielded basically similar findings. pain descriptor words, there are other compo-
The original test manual has a thorough dis- nents on the McGill PQ (e.g., questions about
cussion of how to administer the test, but the the patient’s diagnosis, an anatomical drawing
discussion of reliability and validity consists sim- on which to indicate the location of pain), but
ply of enumerating various approaches that have these are not scored. On the actual test protocol,
been taken, without giving specific results. The the items are presented in 20 groupings of 2 to 6
interested user of this test then, must dig through items each, listed in order of intensity. For scor-
the literature to obtain any relevant data. ing purposes, the 20 groupings are divided into
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 417

4 categories: sensory, affective, evaluative, and two to seven factors (Prieto & Geisinger, 1983).
miscellaneous. In general there appears to be support for the
The McGill PQ takes 5 to 15 minutes to com- three dimensions postulated by Melzack, though
plete. Originally, it was intended to be used the dimensions are substantially intercorrelated
as an interview so unfamiliar words could be (Turk, Rudy, & Salovey, 1985).
explained, but it is typically self-administered.
The instructions indicate that the patient is to Validity. There are various lines of evidence that
describe his or her present pain, and only one support the validity of the McGill PQ. One such
word in each grouping is to be circled. In research line is that the responses of patients with differ-
use and in clinical applications the instructions ent pain syndromes, such as menstrual pain vs.
may vary – the patient may be asked to describe toothache, result in different word constellations
their current pain, their average pain, their most (Dubuisson & Melzack, 1976). Another line are
intense pain, and so on. studies that show the sensitivity of the McGill
PQ following cognitive or behavioral interven-
Scoring. In addition to the three dimensions, tions to reduce pain (e.g., Rybstein-Blinchik,
there is a fourth “miscellaneous” category. Pain- 1979).
rating scores can be obtained for each of the The McGill PQ has been translated into
four areas, as well as for the total. There are also numerous languages, and studies done in such
four indices that can be obtained: (1) Pain Rating countries as Germany, The Netherlands, and
Index, based upon the mean scale values of the Kuwait generally support the cross-cultural use-
words endorsed; (2) Pain Rating Index of ranks, fulness of this measure (Naughton & Wiklund,
based upon the ranks of the words endorsed; (3) 1993).
Total number of words checked; and (4) Present
Pain Intensity, based on the rating scale of 0 (for Criticisms. Three disadvantages of the McGill
no pain) to 5 (for excruciating pain). As might PQ are the time required to administer and score
be expected, the first two indices correlate sub- (in excess of 30 minutes for what seems like a
stantially (in excess of .90) and thus only the Pain simple questionnaire), the need for trained per-
Rating Index of ranks is used because it is sim- sonnel to administer it because it is confusing
pler to compute. Number of words chosen also to many patients, and the fact that a number of
correlates quite high with the two pain ratings words are unfamiliar to many patients (e.g., lan-
(r = .89 and above), while the ratings of present cinating and rasping).
pain intensity correlate less substantially with the
other three indices. The initial scoring system Modifications. Several modifications of the
was weak, and suggestions have been made to McGill PQ have been proposed, such as a card
improve it (Kremer, Atkinson, & Ignelzi, 1982; sort method (Reading & Newton, 1978) or rat-
Melzack, 1984). ing the words through other methods than rating
scales (e.g., Gracely, McGrath, & Dubner, 1978;
Reliability. Test-retest reliability is difficult to Tursky, 1976). There is also a short form of the
measure because low reliability may in effect McGill PQ available in which respondents are
mirror the test’s sensitivity – pain in its vari- asked to pick the one word from each category
ous sources and manifestations does change over that is most applicable (Melzack, 1987).
time. The evidence in fact seems to suggest that
the scale is sensitive to changes in pain over time
The Jenkins Activity Survey (JAS)
(e.g., C. Graham, Bond, Gerkovich, et al., 1980).
The JAS was constructed to measure Type
Factor analysis. Studies of the factor struc- A behavior. Type A behavior, the coronary-
ture of the McGill PQ have typically provided prone behavior pattern, is an overt behavioral
support for a sensory dimension, an affective- syndrome or style of living characterized by
evaluative dimension, and a sensory-affective extreme competitiveness, striving for achieve-
dimension, with different studies finding from ment, aggressiveness, impatience, restlessness,
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

418 Part Four. The Settings

and feelings of being challenged by responsibility equation best combined efficiency of prediction
and under the pressure of time. In addition to with brevity.
providing a Type A score, the JAS also provides A second edition of 57 items was published
separate factor scores for three components of in 1966. Again a series of analyses were under-
Type A behavior: a speed and impatience factor, taken with a resulting 26-item discriminant equa-
a job-involvement factor, and a hard-driving and tion. At this point, a series of factor analyses
competitive factor. were undertaken, and the three factors named
The JAS is based on the work of M. Friedman above were identified. A third revision came out
and R. Rosenman, two cardiologists who con- in 1969, and discriminant equations were devel-
ducted a series of studies on the role of behav- oped using each of the factor scores. In 1972, a
ior and the central nervous system in the devel- fourth revision came out, with items revised in
opment of coronary disease. They defined the language so that JAS items could be appropri-
coronary-prone behavior pattern Type A and ate for housewives, medical students, and other
focused on the excesses of aggression, hurry, and groups, rather than simply for the employed,
competitiveness, all of which are manifestations middle-class, middle-aged men of prior studies.
of a struggle to overcome environmental barri- In 1979, a fifth edition came out (Form C) that
ers. The pattern is neither a personality trait nor a consists of 52 items.
standard reaction to a challenging situation, but Here are two items to illustrate the JAS:
rather the reaction of a characterologically pre-
disposed person to a situation that challenges him 1. Do you ever have trouble finding time to get
or her. your hair cut or styled?
Initially, Type A behavior was assessed through (a) never
a structured interview, which although appar- (b) occasionally
ently valid, required rather rigorous training (c) almost always
(Friedman & Rosenman, 1959; C. D. Jenk- (if you select (a), you receive 3 points on the
ins, Rosenman, & Friedman, 1967). Other Speed and Impatience factor, (b) = 16 points, and
measures were also developed, including a (c) = 40 points)
semantic-differential approach, experimental- 2. Is your everyday life filled mostly by
performance tests, and voice analysis. The JAS (a) problems needing a solution?
was developed in an effort to provide a more stan- (b) challenges needing to be met?
dard psychometric procedure accessible to indi- (c) a rather predictable routine of events?
vidual practitioners and researchers. (d) not enough things to keep me interested or
busy?
Development. The first experimental form was (a) = 11 points; (b) = 10; (c) = 1; (d) = 12 points
developed in 1964 and drew heavily from the for Type A scale. For Job Involvement, however,
Rosenman Structured Interview. This form (64 (a) = 24 points, (b) = 26; (c) = 2, (d) = 9)
multiple-choice questions) was administered to
120 male employees of a large corporation and The manual clearly indicates that JAS scores
the results compared with the Structured Inter- should not be used alone to predict individual
view. Forty questions statistically discriminated risk of coronary heart disease. The Type A pat-
Type A from Type B (i.e., not A) individuals. tern is one of several important risk factors, but
These 40 questions plus 21 new ones were then none of these risk factors is sufficiently sensitive
published as the first edition of the JAS in 1965. to permit individual prediction. In general, how-
There followed a series of sophisticated studies ever, Type A people have a higher probability of
that essentially computed weights for each item developing coronary heart disease.
based on that item’s ability to discriminate Type The JAS was standardized on the male partic-
As from Type Bs and placed these items into dis- ipants of a longitudinal study on cardiovascular
criminant function equations. The samples used diseases, who held middle-to-upper level occu-
were substantial (a total of almost 3,000) and pations in 10 large California corporations; they
the procedures cross-validated more than once. ranged in age from 44 to 64 years. The JAS is
The results indicated that a 19-item discriminant primarily applicable to male, employed subjects,
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 419

although initial evidence suggests it is equally arteries obstructed 50% or more, scored signifi-
applicable to employed women. A number of cantly higher on all four JAS scales than the other
questions refer to salaried employment, and thus 36 men with lesser atherosclerosis.
the JAS may not be applicable to students, retired The JAS has been translated into numerous
persons, or self-employed individuals. It is not languages and has been the focus of a voluminous
applicable to adolescents and children. body of literature (see reviews by Glass, 1977, and
The JAS requires an eighth-grade reading level, by Goldband, Katkin, & Morell, 1979).
and 15 to 20 minutes to complete. Hand scor-
ing is possible, but it is a time consuming task
FORENSIC PSYCHOLOGY
(about 25 minutes per protocol), subject to cler-
ical errors. Computer scoring is available and In the area of forensic psychology, two of
preferable. Raw scores are converted to standard the major applications of psychological testing
scores with a mean of 0 and a SD of 10. The involve competence to stand trial (i.e., assessing
manual presents a table of means and SDs for 35 the defendant’s ability to understand and partic-
different samples, most with sizable Ns; there is ipate in legal proceedings), and criminal respon-
also a table of percentile equivalents, and a table sibility or insanity (i.e., the defendant’s mental
of correlations with CPI scales. status at the time of the crime). A number of sur-
veys of forensic psychologists and psychiatrists
Reliability. Reliability was assessed by internal- indicate that most practitioners perceive psy-
consistency methods. The resulting coefficients chological testing to be highly essential to these
range from .73 to .85. Test-retest correlations with forensic procedures, and that tests such as the
intervals of 1 to 4 years fall mostly between .60 MMPI and the WAIS are used by the vast major-
and .70. However, many of the coefficients are ity, with other tests such as the Rorschach and
based on successive modifications of the JAS and the Halstead-Reitan used with lesser frequency
thus reflect both change over time and differences (Borum & Grisso, 1995; Grisso, 1986; Lees-Haley,
between forms. 1992).
Heilbrun (1992) suggested criteria for select-
Validity. There are several lines of evidence that ing, using, and interpreting psychological tests in
support the construct validity of the JAS. First, a forensic setting. These included the availability
there is high agreement between the JAS and a of the test and reviews of its properties (e.g., listed
structured interview (e.g., Jenkins & Zyzanski, in the Mental Measurements Yearbook); exclusion
1980). Second, several studies have found a sig- of a test with a reliability coefficient less than
nificant relation between Type A behavior and .80; relevancy of the test to the forensic issue
coronary heart disease; these studies have typ- (supported with published, empirical data); like-
ically compared patients with coronary disease lihood of replicating an ideal and standardized
with a control group, in a retrospective design testing environment, as close to the conditions
(e.g., Jenkins, Rosenman, & Zyzanski, 1974). under which the test was normed; appropriate-
Third, the manual cites one prospective study, ness of the test to the individual and situation;
which is the normative study. Analysis of JAS consideration of clinical data in lieu of or in con-
scores of 2,750 healthy men showed the Type A junction with actuarial data; and assessment of
scale to distinguish the 120 future clinical cases the individual’s response style and its impact on
of coronary heart disease. In another analysis, testing results.
three risk factors discriminated between the 220
men surviving a single coronary event and the Competency to stand trial. In a 1960 legal case
67 having recurring events: the Type A score, (Dusky v. United States [1960]), the legal stan-
number of daily cigarettes, and serum cholesterol dard for competency to stand trial was estab-
level. Finally, the manual cites a study where male lished. Such a standard holds that the defendant
patients suffering from a variety of cardiovascular must have sufficient present ability to consult
disorders and undergoing coronary angiography with his or her lawyer and have a rational as well
completed the JAS on admission to the hospital. as factual understanding of the legal proceedings.
Fifty-five men, with two or more main coronary Mental-health professionals are often asked to
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

420 Part Four. The Settings

undertake such evaluations, and they often use assess a defendant’s appreciation of their specific
either traditional instruments such as the MMPI situation.
or instruments developed specifically to make a
determination of competency. Two such instru-
Mental illness. A substantial number of individ-
ments are the Competency Screening Test and
uals who are in jail awaiting trial show evidence of
the Georgia Court Competency Test (see Nichol-
mental illness. Such prevalence rates range from
son, Briggs, & Robertson, 1988, who include the
about 5% to 12% for severe mental illness and
items for both scales in their article). Both of these
from 16% to 67% for any mental illness (Teplin,
measures show excellent interscorer agreement
1991). Hart, Roesch, Corrado, et al. (1993) point
greater than .90, and substantial agreement with
out that mentally ill inmates present two major
forensic staff decisions regarding competency to
concerns to corrections administrators: (1) jails
stand trial.
have a legal responsibility to provide some health
The Competency Screening Test consists of
care, including mental-health care, to inmates or
22 sentence-completion stems, such as “If I am
face civil suits; (2) such inmates typically require
found guilty, I . . .” The stems describe hypo-
different institutional procedures and routines,
thetical legal situations for which the respondent
such as segregation from other inmates and more
must provide appropriate responses. Thus, this is
intense supervision. There is therefore a need
not a projective device, in that there isn’t the free-
for the identification of such individuals through
dom of responding found in projective sentence-
rapid and accurate procedures; unfortunately, the
completion tests. Responses are scored from 0 to
very setting and its limitations result at best in a
2 to reflect competency. The test can be admin-
quick and routine screening.
istered orally but is intended as a written test.
Teplin and Swartz (1989) developed a Refer-
The Georgia Court Competency Test (GCCT)
ral Decision Scale (RDS) composed of 18 ques-
consists of 17 questions that cover four areas: (1)
tions, later collapsed into 15 items. Most of the
understanding of court procedures, (2) knowl-
questions ask whether the respondent had expe-
edge of the charge, (3) knowledge of possible
rienced specific symptoms such as feeling poi-
penalties, and (4) ability to communicate ratio-
soned or had loss of appetite. These symptoms,
nally with an attorney. A representative item is,
taken from a diagnostic interview, were predic-
“What does the jury do?” Items are scored accord-
tive of three conditions: schizophrenia, bipolar
ing to different weights and can add up to 50
disorder (mania), and major depressive disorder.
points. Total scores are multiplied by 2. Factor
The RDS was intended for use by correctional
analysis initially indicated two principal factors
officers rather than by psychologists or other
labeled as “legal knowledge” and “defendant’s
mental-health professionals. Scoring is simple:
style of responding,” but further factor analyses
the number of items endorsed in each of the three
did not replicate these results (see Rogers, Ustad,
symptom areas, plus a total score.
Sewell, et al., 1996).
The RDS was administered to a sample of
The GCCT revised version (the GCCT-
790 pretrial defendants in the Vancouver, Canada
Mississippi State Hospital Revision) contains 21
area. Of these, 40.7% of defendants were found to
items and is similar in format and administra-
exhibit symptoms of mental disorder. The results
tion to its predecessor. The GCCT-MSH has been
of this study indicated excellent interjudge relia-
criticized regarding its limited utility with diverse
bility and acceptable validity as a screening mea-
populations (Mumley, Tillbrook, & Grisso, 2003)
sure. However, a substantial number of false pos-
and the lack of focus on a defendant’s decision-
itive errors were made, as compared to other
making abilities (Zapf & Viljoen, 2003).
assessments.
Another competency instrument, the
MacArthur Competence Assessment Tool
Criminal Adjudication (MacCAT-CA), requires The polygraph. In the 1920s, the first forerun-
approximately 30 minutes to administer its ner of the modern polygraph was constructed –
22 items. The MacCAT-CA uses a vignette to a machine that made continuous recordings
assess an individual’s reasoning abilities and of blood pressure, pulse rate, and respiration.
knowledge of legal proceedings, and questions to The assumption was that these physiological
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 421

correlates of emotions could be used as an index age and judicial outcome, but these correlations
of lying. tend to be modest at best.
Some authors consider the polygraph a psy- The authoritarian personality is a constellation
chological test, very much like all the other instru- of characteristics that includes a desire to be part
ments we have discussed, and thus needing to of an orderly, powerful society, with well-defined
meet reliability and validity standards. Unfor- rules and authoritative leadership, a preference
tunately the reliability of the polygraph is usu- for more conventional norms, hostility toward
ally reported as total percentage of agreement out-group members, and a belief in the right-
between two or more raters, or between test and ness of power and control. It has been hypothe-
retest results. Although these rates are typically sized that individuals high on authoritarianism,
in the 80% to 90% range, these results are open if selected to be jurors on a trial, would be likely
to the same criticism discussed earlier regarding to convict the defendant if the defendant is per-
the Rorschach (Kleinmuntz & Szucko, 1984). The ceived to reject legitimate authority. There are
validity of polygraph tests is a highly contested a number of personality scales that have been
issue, perhaps even more acrimonious than the developed to measure authoritarianism, with the
issue of validity of the Rorschach. California F (fascism) scale being quite popu-
lar, especially in the 1950s and 1960s. Other
researchers have developed measures of legal
Voir dire. Voir dire is the process of jury selec- authoritarianism, with the item content focus-
tion, particularly the elimination of potential ing on beliefs about the legal system. For exam-
jurors whom the attorneys feel may be biased ple, the Legal Attitudes Questionnaire (LAQ;
against the defendant or may not be open minded V. R. Boehm, 1968) contains 30 items arranged in
to what is to be presented in the courtroom. Stud- triads, in which one item is authoritarian, one is
ies of jury selection attempt to identify those vari- antiauthoritarian, and one is equalitarian. Con-
ables that can reliably predict juror verdicts. This sider as an example the following triad:
is not only an important practical issue of interest
to lawyers who may wish to maximize the possi- (a) If a person is obviously guilty, they should be
bility of obtaining a specific verdict, but is also given a speedy trial.
an important theoretical issue because under- (b) Most people who are arrested are innocent.
standing how people arrive at a verdict may be (c) Simply because a person does not testify on
applicable to the understanding of how individ- their behalf, should not be taken as evidence of
uals arrive at solutions in general, and how such their guilt.
aspects as personality interface with cognitive
abilities. A wide variety of procedures are used Authoritarian items are those that are essen-
in this process. In terms of psychological testing, tially punitive in nature or accept the role of
there are three major categories: (1) scales that legal authority without question. Antiauthoritar-
have been developed to measure attitudes and/or ian items place the blame for crime on the fabric
personality; (2) scales developed specifically for of society and reject the actions of legal author-
the voir dire process; and (3) biodata question- ity, while equalitarian items endorse nonextreme
naires (Saks, 1976). positions or reflect the possibility that more than
Under category (1) measures of authoritarian- one answer is possible.
ism have proven to be popular, with many studies The subject is asked to indicate for each triad,
showing that potential jurors high on authoritar- which item he or she agrees with the most, and
ianism are more likely to convict or punish the which the least. These responses are recoded as
defendant (e.g., McAbee & Cafferty, 1982). Under ranks, with the positive item given a rank of 3,
category (2) a number of instruments have been the unselected item a rank of 2, and the nega-
developed, but many of these also seem to focus tively marked item a rank of 1. The ranks assigned
on authoritarianism (e.g., Kassin & Wrightsman, to the 10 authoritarian items are summed (or
1983). Under category (3) studies that use bio- averaged) to obtain the authoritarian subscale
data information generally show significant cor- score. The antiauthoritarian and the equalitar-
relations between biographical variables such as ian subscale scores are computed in the same
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

422 Part Four. The Settings

manner. The author hypothesized that scores on adults. Internal consistency was now .83, but
the authoritarian scale would be positively corre- again the validity results were mixed, support-
lated with the tendency to convict, scores on the ive of the authoritarian dimension, but not the
antiauthoritarian scale would be negatively cor- other two. For another example of how a juror
related, and scores on the equalitarian scale not scale was constructed, in this case, to measure
correlated. Indeed this was the case in a sample the pretrial biases a juror may have, see Kassin
of 151 college students who were presented with and Wrightsman (1983).
vignettes of a murder case. A handful of stud- Narby, Cutler, and Moran (1993) conducted
ies that have used this questionnaire with differ- a metaanalysis of studies that looked at authori-
ent types of subjects and different experimen- tarianism and the juror’s perception of defendant
tal designs have generally found support for the culpability. They did find a relationship, although
validity of the authoritarian subscale, mixed sup- the estimated average r was .16, a modest effect
port for the antiauthoritarian scale, and absent at best. The effect was a bit stronger when legal
or ambiguous results for the equalitarian scale authoritarianism was considered vs. personality
(Kravitz, Cutler, & Brock, 1993). measures of authoritarianism.
Note a couple of “peculiar” aspects about this
questionnaire. First, this is a forced-choice format
LEGAL STANDARDS
where the subject is “forced” to make a particular
type of response. Second, the unselected items What are the legal standards applicable to psy-
are scored 2 points. Third, because of the way the chological testing? There are two such sets. One
items are scored, the scales are not independent; is the “Federal Rules of Evidence (or FRE) that
if a person endorses the authoritarian items, for require mental-health professionals to offer evi-
example, the scores on the other two subscales dence (e.g. results of psychological testing) that
must perforce be lower. “is reasonably relied upon by such professionals.”
Kravitz, Cutler, and Brock (1993) adminis- A second set is known as the Daubert standard
tered both the LAQ, a revised version of the that evidence presented by experts must be reli-
LAQ requiring a Likert-type response, and sev- able (testable, subject to peer review, with known
eral other scales to a sample of undergradu- rate of error, and/or generally accepted within the
ate psychology students. One of their first find- field) and relevant (applicable to the specific case
ings was that the internal reliability (coefficient and helpful to the fact finders in resolving the
alpha) for the LAQ subscales and the revised legal question).
version was “abysmal,” ranging from .19 to .71,
with six of the seven coefficients below .70.
LEGAL CASES
Despite such low reliability, the construct and
concurrent validity seemed satisfactory. Author- A number of legal acts and court cases have had
itarian scores, for example, correlated with other substantial impact on the practice of psychologi-
measures of authoritarianism and with attitudes cal testing. We have discussed some of these in the
toward the death penalty, as well as with attending context of testing children (Chapter 9) and test-
religious services and believing in a “just world” ing in occupational settings (Chapter 14). Here
(i.e., people deserve what happens to them). On we briefly mention a few of the better known
the basis of these results, Kravitz, Cutler, and ones.
Brock (1993) dropped the items with low item-
total correlations and obtained a 23-item scale,
Title VII of the Civil Rights Act of 1964
with an internal reliability of .71. Several fac-
tor analyses of this 23-item form failed to indi- This act, commonly known as the Equal Employ-
cate factors that were reliable and clearly inter- ment Opportunity Act, is probably the best
pretable. When compared to the other measures known piece of civil-rights legislation. This act
collected in the study, the authors concluded that created the Equal Employment Opportunity
the 23-item form had better concurrent valid- Commission, which eventually published guide-
ity but poorer discriminant validity. The 23-item lines concerning standards to be met in the con-
version was then administered to a sample of struction and use of employment tests. Title
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 423

VII of this act makes discrimination against any qualify for promotion. The written test had not
individual on the basis of race, color, religion, been validated, and the proportion of blacks to
sex, or national origin illegal. If testing is used whites who passed the test was 68%. However,
as a condition for employment, then testing can- passing the test did not guarantee promotion,
not discriminate. In essence, this act also dictates and a greater percentage of blacks than whites
that tests should be prepared by experts and be was promoted. The U.S. Supreme Court ruled,
validated according to the standards indicated in however, that the test still had to be validated
the Uniform Guidelines on Employee Selection and might be discriminatory.
Procedures.
It should be noted here that in-house tests Watson v. Fort Worth Bank and Trust (1988).
(such as the integrity tests we discussed in Chap- Ms. Watson, a black woman, sued because she
ter 14) are usually more difficult to validate than was denied a promotion based on work evalu-
professionally researched tests (such as the MMPI ations by white supervisors. Although both the
and the WAIS), and hence their use might be con- district court and the appeals court dismissed
sidered discriminatory. This act led to a number the case, the U.S. Supreme Court accepted Wat-
of court cases of which the following are the most son’s evidence that blacks received lower aver-
relevant: age performance ratings and fewer promotions
than did whites, and concluded that subjective
Myart v. Motorola (1966). This was one of the employment practices must meet the same objec-
first cases to focus on employment testing dis- tive standards as tests.
crimination. Mr. Myart was a black applicant for
a job at a Motorola factory; he alleged that the Daubert v. Merrell Dow Pharmaceuticals (1993).
qualifying test he was asked to take was discrim- This case, along with existing Federal Rules of
inatory in that it required familiarity with white, Evidence, provided the Court with guidelines to
middle-class culture. The Illinois Fair Employ- assess the appropriateness, relevancy, and admis-
ment Practices Commission agreed that the test sibility of scientific expert testimony (e.g., psy-
was discriminatory, but the Illinois Supreme chological testing). Specifically, it was recognized
Court overturned the examiner’s ruling. that the admissibility of expert testimony should
be judged on its reliability (e.g., testability, subjec-
Griggs v. Duke Power Company (1971). The
tion to peer review and publication, with a known
hiring requirements of Duke Power Company
error rate, and/or general acceptance within the
included satisfactory scores on the Wonderlic
field) and relevancy to the legal issue (e.g., help-
Personnel Test and a mechanical aptitude test;
ful to the trier-of-fact). A 1999 case (Kumho Tire
these were challenged by blacks as arbitrary and
Company v. Carmichael [1999]) noted that the
not job-related. The trial court ruled that the tests
“Daubert criteria” applies to the testimony of all
did not violate Title VII, but the U.S. Supreme
experts, not just “scientists,” and that the Court
Court disagreed, and in fact criticized employ-
may consider one or more of these factors in their
ment testing in general. The court ruled that
determination of admissibility.
broad, general testing devices could not be used,
but that measures of the knowledge or skills
required by a specific job could be used. Tests in Educational Settings

Albemarle Paper Company v. Moody (1975). In A number of court cases have had an impact on
this case, the company also required satisfactory the use of tests in educational settings. The fol-
scores on two tests, but had hired a psychologist to lowing are some of the better known examples:
conduct a validity study of the tests on their cur-
rent employees. The Supreme Court concluded, DeFunis v. Odegaard (1971). Mr. DeFunis
however, that having a psychologist claim validity applied for admission to the University of Wash-
was not sufficient. ington Law School, but was not admitted. He had
an undergraduate GPA of 3.71, had been elected
Connecticut v. Teal (1982). An employer to Phi Beta Kappa, and had taken the Law School
required employees to pass a written test to Admission Test three times with scores of 512,
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

424 Part Four. The Settings

566, and 668. He had also applied and had been to be used, and the right to examine individual
admitted to two other law schools. He sued the test items.
University of Washington Law School (Odegaard
was the President) on the grounds that other per-
Other Legislative Actions
sons with lesser qualifications had been admit-
ted. The Superior Court of the State of Wash- In addition to the Civil Rights Act of 1964, there
ington held that the admission procedure was have been a number of major legislative actions
racially discriminatory and ordered that DeFunis that have impacted psychological testing. The
be admitted to the law school. The Washington major ones are:
Supreme Court, however, reversed the judgment
of the Superior Court. By the time that several Public Law 94–142 (1975). In 1975, Congress
legal maneuvers had been played, DeFunis was passed the Education for All Handicapped Chil-
completing his degree and the case was declared dren Act, known as Public Law 94–142, and
“moot”; no decision was made. recently as the Individuals with Disabilities Edu-
Breland and Ironson (1976) present an inter- cation Act. This law is intended to ensure a free
esting discussion of this case, considering the and appropriate public education, and all related
actual admission process used by the law school educational services, to children with disabili-
vs. what psychometric models would consider ties. Thus the law mandated that children with
fair. It is interesting to note that the admission possible mental or physical handicaps are to be
procedure used admitted 53% of selected ethnic identified through the use of screening instru-
groups, but only 15% of whites. ments, evaluated, and a specific educational plan
designed to meet the child’s special educational
Debra P. v. Turlington (1981). Minimal compe- needs. This law has had a tremendous impact on
tency legislation refers to individual state laws assessment, and more specifically, on psycholog-
that typically result in a formal testing pro- ical testing, because the law mandates, in part,
gram designed to assess the minimum compe- that a child be assessed with “valid tests” that
tencies that youngsters must have before being are culturally and racially appropriate. In 1990,
issued a high-school diploma or other edu- a new amendment to this act replaced the term
cational endorsements. In 1976, the State of “handicapped” with the term “disability.”
Florida passed the Educational Accountability
Act designed, in part, to ensure that the state’s The Rehabilitation Act of 1973. This act pro-
educational system in fact educated students, tects qualified disabled individuals against dis-
at the very least, to meet certain minimum crimination in employment. Note first the term
competencies. What better way to assess this “qualified” – that is, the person must be capable
than through minimum competency testing? The of performing the essential functions of the job.
result was a number of pro and con opinions (e.g., Second, note that disability is a fairly broad term
B. Lerner, 1981; Popham 1981), and a number of that includes conditions such as alcoholism and
legal challenges including the court case of Debra drug addiction (unless current drug use interferes
P. The minimum competency test was challenged with the job performance), muscular dystrophy,
on grounds of racial bias (20% of black students epilepsy, cancer, tuberculosis, AIDS, and many
failed the exam vs. only 2% of white students) and more. This act not only protects but also requires
questionable “instructional validity” (i.e., was the affirmative action to employ qualified disabled
test a fair representation of what was taught in individuals. One of the practical aspects of this
the classroom?). This is not a closed issue, with act with regard to testing, is that an applicant
some states having minimum competency tests who contends that because of their disability they
and others facing legal challenges on the issue. cannot demonstrate the required skills on a test,
but argues that they can perform the job, must
Truth-in-testing legislation. These laws give be given alternate means for demonstrating job
examinees the right to know more about exami- suitability.
nations, particularly, tests such as the SAT and the
GRE. They cover issues as privacy, information The Americans with Disabilities Act of
about the test’s development, how test scores are 1990. This act prohibits all employers from
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

Clinical and Forensic Settings 425

discriminating against disabled employees or job only bizarre but violated sexual, religious, and
candidates; however, the federal government, racial discrimination laws. The store settled with-
Indian tribes, tax-exempt private membership out admitting legal wrongdoing.
clubs, and religious organizations, are exempt.
Here too disabled is a rather broad category, Bilingual children. At least two court cases have
although some specific conditions such as bisex- focused on the misdiagnosis of bilingual children:
uals, compulsive gamblers, and kleptomaniacs Diana v. State Board of Education of California
are excluded. This act requires that testing be (1970) and Guadalupe v. Tempe Elementary Dis-
conducted in a place and manner accessible to trict (1972). In both suits the argument was that
persons with disabilities – for example, test using improper assessment with IQ tests led to an over
Braille with visually impaired. One consequence representation of minority children in classes for
is that tests used in employment selection must the educable mentally retarded. The Diana case
measure the skills essential to a particular job, involved nine Spanish-speaking children who
rather than intelligence in general, and not were placed in classrooms for the educable men-
reflect a person’s disability. Thus an item on a tally retarded on the basis of their performance
personality test such as, “As a child I was sickly” on tests of intelligence that were administered in
would be in violation if that test were used for English. These children had in fact been misdiag-
employment purposes. nosed. In the Diana case, the court directed that
minority children be tested in both native lan-
The Civil Rights Act of 1991. This act was guage and English and that cognitive assessment
designed to facilitate worker lawsuits based on job be carried out primarily with nonverbal tests. In
discrimination. In terms of testing, the act pro- Guadalupe, the court again directed that testing
hibits the use of race norming and places employ- be carried out using the child’s native language,
ers in a quandary. On the one hand, employers but also that IQ could not be used as the only basis
can hire whomever they choose; if they use a test for making placement decisions about a minor-
they can hire the most qualified (i.e., highest scor- ity child; adaptive behavior and/or other areas
ing) individuals. But given the differences in test needed also to be assessed.
scores between whites and minority members,
such a procedure would have an adverse impact. Larry P. v. Riles (1972). Larry P. was a black child
These various acts also resulted in a number “erroneously” classified as mentally retarded,
of legal challenges and counter challenges. Some while Wilson Riles was the Superintendent of
illustrative examples are: Public Instruction in California. Because of this
trial, intelligence tests were found to be biased
Gaines v. Monsanto (1983). A relatively new against African American children, and the San
form of liability sustained by court decisions hap- Francisco public schools were prohibited from
pens when an employer fails to exercise “rea- using such tests. This trial provided considerable
sonable care” in hiring an employee who subse- impetus for the study of test bias.
quently commits a crime or does harm to others. In the Larry P. v. Riles (1972, 1974, 1979) case,
In this case, a mailroom clerk who had a prior the issue was again over-representation of minor-
record of rape and conviction, followed a sec- ity children in classes for the educable mentally
retary home and murdered her. The parents of handicapped, but the focus was on black chil-
the victim sued Monsanto for negligent hiring. dren. Much of the case revolved around the issue
Quite clearly, the employer needs to take pre- of whether the WISC-R was biased when used
ventive measures that might include additional with black children, and the judge did find the
testing, yet such testing may well be challenged test biased. In California then, IQ testing of all
as discriminatory. black children was prohibited, regardless of what
program the child was being considered for (e.g.,
Target Stores 1989. A class action suit was filed, gifted), and whether or not the parents gave
also in California, against Target Stores who used permission. Interestingly enough, in a 1980 case
the CPI and the MMPI to screen prospective secu- just the opposite conclusion was reached. In Par-
rity guards. The plaintiffs argued that the ques- ents in Action on Special Education (PASE) v.
tions, such as “I am fascinated by fire,” were not Hannon, in the Chicago public school system,
P1: JZP
0521861810c15 CB1038/Domino 0 521 86181 0 March 4, 2006 14:20

426 Part Four. The Settings

the WISC-R and the Stanford Binet were found Presents a hassles scale developed specifically for college
not to be biased against black children. students. High scores on this scale are associated with a
pessimistic outlook and more persistent symptoms of poor
psychological adjustment.
Sharif v. New York State Education Depart-
ment (1989). The N.Y. State Education Depart- Foster, G. D., & Wadden, T. A. (1997). Body image
ment used SAT scores to award state-funded in obese women before, during, and after weight loss
treatment. Health Psychology, 16, 226–229.
scholarships. The American Civil Liberties Union
brought suit charging sex discrimination because A major variable in the field of eating disorders is that of
body image. Over the years, many scales have been devel-
the SAT “underpredicts the academic ability of oped to assess this dimension, from Rorschach responses to
women.” The judge agreed and, as a result, schol- self-report inventories, from standardized body silhouettes
arships were awarded on the basis of both SAT to experimental laboratory procedures. This is a fairly repre-
scores and high-school GPA. sentative study that uses two scales from a multidimensional
questionnaire.

Kleinmuntz, B., & Szucko, J. J. (1984). Lie detection in


SUMMARY ancient and modern times. American Psychologist, 39,
We have taken a brief look at testing in the con- 766–776.
text of clinical and forensic settings. The examples The authors assume that the polygraph is basically a psy-
chological test, although of questionable psychometric merit.
chosen were illustrative rather than comprehen- They discuss the history of this technique, as well as a number
sive. We devoted a substantial amount to pro- of psychometric issues.
jective techniques, although these, as a group,
Lambert, N. M. (1981). Psychological evidence in
seem to be less popular than they were years ago.
Larry P. v. Wilson Riles. American Psychologist, 36, 937–
However, they are, if nothing else, of historical 952.
importance and do illustrate a number of issues
This article presents a brief history of this landmark case, and
and challenges. Testing in the health psychology argues convincingly that the court’s decisions were erroneous.
area is growing and substantial headway is being
made in applying psychological principles and
techniques to the maintenance of good health DISCUSSION QUESTIONS
and the understanding of how the human “body
1. Why use batteries like the Halstead-Reitan
and psyche” actually function in unison. Testing
when brain functioning can now be assessed
is also increasing in forensic settings, while at the
through a number of medical procedures?
same time being regulated and challenged by a
number of court decisions. 2. You have been admitted to graduate school to
pursue a doctorate in clinical psychology. You
have the option to take a course on projec-
SUGGESTED READINGS tive techniques. Would you and why (or why
Aronow, E., Reznikoff, M., & Moreland, K. L. (1995). not)?
The Rorschach: Projective technique or psychometric 3. You are testing an incarcerated 26-year-old
test? Journal of Personality Assessment, 64, 213–228. male who has a history of sexual and physical
An excellent review of the idiographic-nomothetic and aggression toward women. Which one projective
the perceptual-content approaches as they pertain to the technique would you choose and why?
Rorschach. The authors argue that the value of the Rorschach
is in its being a projective technique rather than a psychome- 4. If someone asked, “Is the Rorschach valid”
tric test. how would you answer?
Blankstein, K. R., Flett, G. L., & Koledin, S. (1991). 5. Scales that measure authoritarianism appear
The Brief College Students’ Hassles Scale: Develop- to be useful to screen potential jurors in the “voir
ment, validation, and relation with pessimism. Journal dire” process. What other scaled variables might
of College Student Development, 32, 258–264. be relevant?
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

PART FIVE: CHALLENGES TO TESTING

16 The Issue of Faking

AIM In this chapter we begin by asking, “What can interfere with the validity of a
particular test score? If Rebecca scores at the 95th percentile on a test of vocabulary,
can we really conclude that she does possess a high level of vocabulary?” While there
are many issues that can affect our conclusion, in this chapter we focus mainly on
just one – faking. We take a brief look at two additional issues – test anxiety and
testwiseness. We use a variety of tests, such as the MMPI and CPI, to illustrate some
basic points; you have met all of these tests in earlier chapters.

and some don’t) but tends to reduce validity.


SOME BASIC ISSUES
Under this label, Cronbach listed several response
Response sets. As discussed in Chapter 2, a test sets that others have subsequently considered as
is typically constructed from an appropriate item separate.
pool by selecting the items that meet certain ratio- 2. Definition of judgment categories. Most
nal and/or empirical and statistical criteria. Basi- tests require the subject to respond using given
cally, what a test measures is determined by the response categories, such as the Likert response
content of the items, yet the final score for a per- scale. But different subjects give different mean-
son reflects not only the item content, but also ings to the response options, e.g., the mean-
the item and response formats – aspects which ing assigned to such response categories as
Cronbach defined as response sets. In a pioneer- “frequently”: Does that mean every day? Six times
ing article, Cronbach (1946) defined response a day? Once a week?
sets as any tendency that might cause a person 3. Inclusiveness. When the subject can make
to consistently give different responses to test as many responses as he or she likes, some indi-
items than he or she would have given if the same viduals make more responses than others. This
content was presented in a different form. He occurs not only on essay exams, where one per-
then listed a number of such response sets that son’s answer may be substantially longer, but also
included: on tests such as the Rorschach, where one per-
1. The tendency to gamble. Cronbach son may see many more percepts on an inkblot,
described this as the tendency to respond to an or the Adjective Check List, where one person
item when doubtful, to omit responses, to be cau- may endorse substantially more items as self-
tious, to answer with a neutral response alterna- descriptive.
tive rather than an extreme alternative (as in Lik- 4. Bias or acquiescence. This is the tendency to
ert scales choosing “unsure” rather than strongly endorse “true” or “yes” to dichotomous response
agree or strongly disagree). Such guessing can items. Such response tendencies affect an answer
increase the reliability of a test because it results only when the student is to some degree uncertain
in greater individual differences (some guess about the item content. Thus, acquiescence tends

427
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

428 Part Five. Challenges to Testing

to make false items more valid and true items less directions vs. fake bad), and see to what degree
valid. the responses differ.
5. Speed vs. accuracy. Where speed of response
is an important element, the respondent can Note, however, that such response styles not only
answer carefully, sacrificing speed, or can answer can confound test results, but also represent legit-
rapidly, sacrificing accuracy. imate dimensions that may reflect personality
6. Response sets on essay tests. Cronbach traits. A person who is more likely to guess on
argued that there were many such response sets a multiple-choice exam, may in real life be more
that involved how organized, fluent, detail ori- of a gambler, willing to take risks, perhaps more
ented, etc., the person might be. impulsive, and so on. Thus, these two aspects –
Cronbach argued that there were individual error variance in content scores and reliable vari-
differences in these response sets (e.g., some peo- ance for “stylistic” personality measures – repre-
ple are more willing to guess if unsure than others sent two separate domains. These domains can be
are), and these individual differences were reli- assessed independently by a variety of techniques
able. He pointed out that response sets have the such as factor analysis (see Messick, 1962).
greatest impact on a test score when the items are
ambiguous or unstructured (another reason to Cognitive styles. There are consistent individ-
stay away from essay-type items). In fact, Cron- ual differences in the way different persons typi-
bach (1946; 1950) suggested the use of multiple- cally perceive, think, remember, solve problems,
choice items as a way to reduce response sets. or perform other intellectual operations. These
Since that time, a number of these response sets have typically been called cognitive styles, and a
(sometimes also called response biases) have been substantial number of them have been identified.
studied (see Broen & Wirt, 1958, for a listing of For example, in decision making, some individ-
11 such sets), including primarily the following: uals prefer to take risks while others prefer to be
cautious. In the area of memory, some individ-
1. social desirability bias (or faking good) uals are “levelers” and others are “sharpeners,”
2. faking bad i.e., they either assimilate or exaggerate stimulus
3. acquiescence (or yea saying) differences. If you ask a leveler what they did on
4. opposition (or nay saying) their vacation, they’ll reply, “I went to the beach”;
5. positional set (extremity and mid-point if you ask a sharpener, they will give you all the
response sets) details.
Although such dimensions are considered
6. random responding (carelessness and/or
“cognitive,” they in fact involve not only cogni-
inconsistency)
tion, but personality, affect, temperament, etc.,
These response sets are seen as potential threats as well as interpersonal domains. Research on
to the reliability and validity of self-report mea- some cognitive styles, such as field dependence
sures, and a number of efforts have been made to vs. field independence, or reflection vs. impulsiv-
deal with such biases. Three such ways are ity, have a fairly substantial base; other cognitive
styles have not been investigated to any degree.
1. Have one or more measures of response bias Obviously, these styles can interact with test
incorporated in the self-report measure – this is performance, yet at the same time they proba-
illustrated by the CPI and the MMPI among oth- bly reflect the person’s behavior. For example,
ers. an impulsive individual may do poorly on a
2. Compare (typically correlate) the results of a multiple-choice test by not considering the alter-
self-report measure with a measure of such bias. natives carefully – but in fact, that’s how they may
For example, if we were to administer a self- well behave in work situations.
esteem measure, we might also include a response We will not consider cognitive styles further,
bias scale as part of our administration. although they are very important, and a number
3. Determine how susceptible a scale is to fak- of scales have been developed to assess specific
ing, typically by having subjects complete the cognitive styles. A consideration of these would
task under different directions (e.g., standard simply take us too far afield. For a recent overview
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 429

of cognitive styles see Sternberg and Grigorenko ranging from 1% to 50% (P. I. Resnick, 1988).
(1997). Binder (1992) argues that the only sensible proce-
dure for the clinician is to consider the possibility
Faking. There are many types of faking and dif- of malingering on neuropsychological exams in
ferent sets of terms. One such term is malingering, every patient, when there is a monetary or exter-
which is seen as intentional faking to obtain some nal incentive. The usual premise is that faking is
external incentive such as monetary compensa- commonplace, but the evidence for such faking
tion, avoiding punishment for criminal behavior, is usually indirect.
or avoiding military duty (APA, 1987). Faking Lanning (1989) studied CPI archives (at
basically refers to deliberate systematic distortion the University of California Berkeley’s Institute
of the responses given to test items because the of Personality Assessment and Research) and
respondent wishes to create a particular impres- reported the incidence of faking good as ranging
sion. This impression may be made up of two from 0%, in various samples studied, to 10.7%,
components: an emphasis on socially desirable in a sample of male police applicants. The pro-
characteristics and/or a denial of negative char- tocols of psychiatric patients were most likely to
acteristics. Various terms are used in the literature be identified as fake bad (11.8% in one sample)
to discuss this topic, including “impression man- and random (7.3%). Male prison inmates yielded
agement” (e.g., Edwards, 1970) and “response a 2% frequency of fake bad, and male military
bias” (e.g., Orvik, 1972). academy students gave a 6.5% frequency of ran-
Rogers (1984) distinguishes four patterns of dom responses. Thus, the prevalence of invalidity
responding to questionnaire items: (1) honest reflects the circumstances of test administration
responding, where a sincere attempt is made to be as well as sample characteristics.
accurate in responding; (2) irrelevant respond- It is estimated that the base rate for malingering
ing, where the response given is not relevant to is 33% among mild head-trauma patients who
item content, such as answering randomly; (3) sought financial compensation (Binder, 1993),
defensiveness, where there is conscious denial or and as high as 77% among accused criminals
minimization; and (4) malingering, which is con- who sought a finding of mental incompetence
scious fabrication or exaggeration. (Frederick, Sarfaty, Johnston, et al., 1994). As
another indicator, we may consider the finding
Incidence of faking. When we instruct subjects that between 40% and 90% of all college students
to take a test first under standard conditions, and cheat (K. M. May & Loyd, 1994).
then under instructions to fake, we are looking at At least under a number of circumstances,
asking “Can a test be faked?” A separate question faking is an infrequent phenomenon. As we
however is, “Does faking occur in practice?” The discussed in Chapter 3, detecting infrequent
first question has received considerable atten- events, that is, those with a low base rate – is
tion, and the answer seems to be a rather strong, a difficult problem. Nevertheless, the literature
“yes.” The second question has received much is replete with successful studies, with success
less attention, and the answer seems to be, being greater when sophisticated analyses such
“depends.” as regression equations are used (Lanning, 1989).
When faking or cheating does take place, we Certainly, the research shows that subjects, when
ordinarily do not find out about it, so it is difficult instructed to do so, are quite capable of faking
if not impossible, to estimate the incidence. Fur- good (e.g., Dunnette, McCartney, Carlson, et al.,
thermore, the incidence varies depending upon a 1962; Hough, Eaton, Dunnette, et al., 1990).
number of variables, such as the type of test, the Most faking that occurs “naturally” is probably
rewards and punishments associated with cheat- fairly unsophisticated and detectable. Many years
ing, the procedures in place to eliminate or detect ago, Osipov (1944) argued that every malingerer
cheating and so on. is an actor portraying his or her interpretation
Some writers feel faking is quite rare (e.g., of an illness and that, in assuming this role, the
Bigler, 1986), while others feel it is quite common malingerer goes to extremes, believing that the
(e.g., R. K. Heaton, Smith, Lehman, et al., 1978). more eccentric the responses given, the more dis-
A literature review cited incidence estimates ordered he or she will be judged to be.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

430 Part Five. Challenges to Testing

Legal issues. The topic of faking is particularly Content vs. style. In the area of personality
relevant to a number of legal issues (H. V. Hall & assessment, we can distinguish, at least theoret-
Pritchard, 1996). For example, the malingering of ically, between what a person says or does (con-
psychosis is of special concern as a defendant may tent) and how a person acts (style). In reality, the
be found legally insane or incompetent to stand what and the how are often intertwined and inte-
trial based on such a diagnosis. However, there grated. If we now apply this distinction to how a
is little empirical research on the prevalence of person responds on a personality inventory, we
malingering in a criminal forensic population, in find that people can respond to the content of an
part, because of the criterion problem – what are item, or to what is considered a “response set,”
we to use as a standard of “truth” against which some aspect of the form of the item – for exam-
we match our test results? ple, they may tend to answer true regardless of
Cornell and Hawk (1989) found an incidence content.
of diagnosed malingering of psychosis of 8% (25 In most applications of personality inven-
cases out of 314) in criminal defendants who tories, we assume that the responses reflect
were referred for evaluation of competency to content – i.e., if the scale is one of responsibility
stand trial and/or insanity. Malingerers did differ we assume that a high score reflects a high degree
from genuine psychotics on a number of variables of responsibility. In constructing such a scale
which, when placed in a discriminant function however, we take appropriate steps to reduce,
analysis, correctly classified 89.1% of the cases. eliminate, or control for such stylistic patterns.
No tests were used in this study although many For example, we would try to have half of our scale
of the symptoms, such as disturbed affect and items keyed true and half keyed false. Some inves-
delusions, could be easily assessed by a test such tigators have argued that test results are heav-
as the MMPI. ily influenced by such stylistic approaches; D. N.
Another issue is that, traditionally, clinicians Jackson and Messick (1958), for example, argued
have worked for the client; if testing is to be that acquiescence was a major source of variance
administered it is for the joint effort of clinician in the CPI.
and client to help the client. However, in forensic
situations, the clinician often assumes a neutral Set vs. style. Rorer (1965) distinguished
role that may be perceived as adversarial by the between “set,” which refers to a conscious or
defendant – i.e., the client is being tested to deter- unconscious desire to respond in such a way as
mine insanity, not necessarily because the client to create a certain image (e.g., to fake good) vs.
is to be helped, but because such an evaluation is “style,” which refers to a tendency to select some
mandated by the court or the legal proceedings. response category a disproportionate amount
of the time, independent of the item content.
Lack of insight. In discussing faking, three major Acquiescence, for example, can be both in that it
issues are of concern: (1) motivation to distort the can refer to a general tendency to be “agreeably
results in a particular way, such as faking mental passive” or to a preference for endorsing “true”
illness or attempting to look more positive than or “agree” categories. Others (e.g., D. N. Jackson
one really is; (2) random responding in a con- & Messick, 1958; McGee, 1967) have defined
scious effort to sabotage the testing situation; and the two somewhat differently, but much of the
(3) inaccurate reporting of one’s abilities, beliefs, literature uses these terms as synonyms.
etc., through lack of self-insight. Although all
three are major issues, much of the literature and Is response set error? There are at least two
research efforts focus on the first issue. points of view here. Some writers think that
response sets in fact represent meaningful dimen-
Demand characteristics. Although much of the sions of behavior to be assessed (e.g., Cronbach,
focus in this chapter is on faking, faking can 1946; 1950), while others think that response sets
be seen as part of a broader question that has need to be corrected for or eliminated from tests
to do with “demand” characteristics – i.e., are (e.g., A. L. Edwards, 1957b; Fricke, 1956). Just
there aspects of the testing situation, such as considering validity, there are also two perspec-
the instructions, that do influence a person’s tives. From the viewpoint of predictive validity,
response? it can be argued that if response set increases the
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 431

predictive validity, then it is not error. Such an other deception – that is, the person who fakes
example occurs on the California F (fascist) Scale, good knows that he or she is making incorrect
where content of the items and acquiescence claims.
of response were confounded (Gage, Leavitt, & How does a test taker fake bad? One strategy
Stone, 1957). From the viewpoint of construct is to overendorse symptoms. Scales such as the
validity, such confounding is error, because it MMPI F scale and the F-K index attempt to detect
interferes with our understanding of the under- this strategy. Another strategy is to endorse spe-
lying nature of the construct (Messick & Jackson, cific symptoms that represent the respondent’s
1958). concept of what mental illness is all about. The
CPI Sense of Well-being scale (originally called
the Dissimulation scale) was developed to detect
Faking is more than faking. In many instances,
such a strategy (Gough, 1954). Faking bad, at least
scales to assess faking also can yield valuable
on personality tests, reflects a desire to appear
information in their own right. For example, on
poorly adjusted, perhaps mentally ill. Faking bad
the CPI the three validity scales can also be useful
may represent a cry for help, a negativistic stance,
in interpreting the individual’s personality struc-
or a manipulative attempt to gain a particular
ture and dynamics. As an example, in high-school
goal (for example, ending up in a psychiatric hos-
males (but not females) random answering on
pital rather than prison).
the CPI is related to a lower probability of going
Most scales developed to measure deception,
on to college, a lower GPA, and a greater likeli-
particularly the MMPI scales, were developed
hood of being perceived as delinquent (Lanning,
empirically and/or rationally, but are heteroge-
1989).
neous in content, like most other personality
scales developed empirically. For example, a fake-
Faking good and faking bad. Traditionally, fak- good scale will typically have both items that rep-
ing good and faking bad were thought to rep- resent excessive virtue and items that involve the
resent opposite poles of a unitary dimension, endorsement of superior adjustment.
but recently, the two concepts are seen as dif-
ferent dimensions to be assessed separately. Fak-
Random responding. Random responding also
ing good is seen as composed of two indepen-
needs to be considered. This may be the result of
dent concepts, typically labeled as “self-deceptive
an honest error, such as placing answers incor-
enhancement” and “impression management.”
rectly on an answer sheet, lack of understanding
Are there also two components in faking bad?
as with a person of borderline intelligence, or
At present, the answer is unclear, but the two
willful behavior, such as someone who is passive-
major scales in this area, the MMPIF scale and
aggressive and does not wish to answer a person-
the CPI Sense of Well-being scale seem to reflect
ality inventory.
two different approaches. In general, the detec-
tion of faking good seems to be more difficult
than the detection of faking bad (R. L. Greene, Personality tests. There are probably three
1988). major ways in which personality test scores may
How does one fake good? Basically, by endors- be distorted: (1) deliberate faking, (2) an ideal-
ing test items that portray personal honesty and ized presentation of oneself as opposed to a more
virtue. The MMPI L scale was developed to detect realistic presentation, and (3) an inaccurate pre-
this strategy. Another strategy of the test taker sentation because of lack of insight. These aspects
is to overendorse items indicative of especially involve habits and attitudes. Habits are usually
good adjustment or mental health. The MMPI more focused on the mechanics of the test – for
Mp (positive malingering) scale was developed example, a person may have the habit of select-
to detect this strategy (Cofer, Chance, & Jud- ing the first plausible answer in a multiple-choice
son, 1949). Faking good is often manifested by item, rather than closely considering all the alter-
a failure to acknowledge commonly held weak- natives. Attitudes, as we saw in Chapter 6, are
nesses and endorsement of a constellation of broader. In terms of testing, such habits and atti-
unusual virtues. If one considers self-deception tudes that may influence test scores are subsumed
vs. other deception, faking good focuses more on under the term “response sets.”
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

432 Part Five. Challenges to Testing

SOME PSYCHOMETRIC ISSUES the MMPI subtle scales have no predictive valid-
ity). An item that illustrates this type comes from
Scale-development strategies. Most scales to detect
studies of deception in interviews; the item “dis-
faking have been developed using one of three
turbed affect – flat/blunted or inappropriate” is
strategies:
in fact characteristic of most psychotics, but has
1. One group – one instruction. Here, an
a low endorsement rate for malingerers (Cornell
item pool is administered to one group, usually
& Hawk, 1989).
normal individuals, under standard instructions,
Type 2 items are endorsed by malingerers,
and the rate of endorsement for each item is cal-
but not by psychiatric patients. An example
culated. For example, we might find that only 7%
is the item, “visual hallucinations.” Few psy-
endorse the item, “I am extremely honest in all
chotic patients actually report such symptoms,
my dealings.” We might then form a scale of such
but many malingerers endorse this item.
items with low endorsements. High scorers on
One approach then is to empirically iden-
this scale would tend to reflect unusual claims.
tify sets of items that statistically discriminate
2. One group – two instructions. Here a group
between individuals who are truly mentally ill
of subjects, often captive college students, take the
and individuals who are instructed to answer as
item pool under standard instructions, and then
if they were suffering from mental illness. Thus,
retake it under faking instructions. The instruc-
if only 2% of mentally ill persons respond true
tions may be quite generic (e.g., fake good), or
to the item, “I hear voices,” but 86% of normals
much more specific (e.g., answer as if you were a
instructed to fake bad do so, then the item would
chronic alcoholic). Items that show significant
be an excellent candidate for a fake bad scale. If
response shifts under the two sets of instruc-
an individual we are testing endorses the item,
tions are retained for the faking scale. For exam-
the probabilities are quite low that the person is
ple, if the item, “I am a wonderful person” is
truly mentally ill, but quite high that the person
endorsed as true by 16% of the sample under
is faking.
standard instructions, but by 62% under the fak-
ing instructions, then that item would be retained
Generic vs. specific scales. One approach is to
for the scale.
develop a scale (or scales) that are “generic.”
3. Two groups – two instructions. Here a group
Thus, a fake bad scale should detect faking bad
that is “deviant,” such as psychiatric patients,
wherever if might occur. This is the implicit belief
is administered the item pool under standard
that many researchers have when they include an
instructions. A second group, usually normal, is
MMPI validity scale as part of their battery of
asked to fake as if they were psychiatric patients.
questionnaires.
Items that show differential endorsement are
A second approach lies in developing scales
retained. For example, the item, “I hear voices”
that are specific to a content area. For exam-
might be endorsed by 28% of psychiatric patients,
ple, the three factors identified in the Timmons,
and by 86% of normal individuals instructed to
Lanyon, Almer, et al. (1993) study are germane to
fake mental illness; that item would be retained
the detection of malingering on a sentence com-
for the faking scale.
pletion test used to examine claims of disability.

Faking psychopathology. To develop proce- Scales used as a correction. Validity scales, such
dures to detect the faking of psychopathology, the as those on the MMPI, were developed primarily
task is to find ways to distinguish between per- for the purpose of identifying suspect protocols,
sons who actually have psychopathology and per- that is, respondents who may have deliberately
sons who pretend that they do. Dannenbaum and distorted their answers. Occasionally, validity
Lanyon (1993) indicate that two types of items scales are also used to “correct” the scores on the
are relevant: Type 1 items have predictive validity other scales. Christiansen, Goffin, Johnston, et al.
but low or no face validity; Type 2 items have face (1994) studied the 16 PF where fake good and fake
validity but no predictive validity. bad scales can be used to add or subtract points
Type 1 items might be like the “subtle” MMPI to the other scales. These corrections basically
items (except for the fact that research indicates treat the faking scales as suppressor variables – a
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 433

suppressor variable is one that removes the vari- Research on the CPI provides some interesting
ance that is assumed to be irrelevant to accentuate answers. Hase and Goldberg (1967) compared
the relationship between the predictor scale and standard CPI scales with several sets of “new”
the criterion. A suppressor variable is one that CPI scales developed through different methods
is significantly associated with a predictor, but (such as rational rather than empirical), as well
not associated with the criterion for which the as a set of stylistic scales. They used a rather com-
predictor is valid. For example, if we have a per- plex validational procedure with 13 validity cri-
sonality scale that predicts leadership behavior teria, such as peer ratings and behavioral indices.
well, we have a predictor (the scale) that corre- They found that the various sets of scales did not
lates well with the criterion (leadership behavior, differ from each other in overall validity, except
however defined). Suppose however, our scale is that the stylistic scales were the least valid and
heavily influenced by acquiescence – i.e., our scale were not useful predictors of nontest behaviors.
correlates with acquiescence, but the predictor It can be argued that the variation in test scores
does not. If we can remove the acquiescence, then due to such stylistic variables is error variance
the correlation between our scale and the predic- and, therefore, needs to be identified and elim-
tor should increase. Generally, if we can identify inated. This can be done by considering such
and measure the suppressor variable, then the scales as suppressor variables. Dicken (1963) did
validity between our predictor and our criterion so by comparing the correlations between stan-
should increase. The Christiansen, Goffin, John- dard CPI scales and a variety of criteria vs. the
ston, et al. (1994) study of assessment-center can- correlations obtained when the stylistic scales are
didates indicated that correction for faking had considered and their influence analyzed statisti-
little effect on criterion-related validity (perfor- cally. He concluded that statistically correcting
mance data based on job analyses), but would personality scores for stylistic response did not
have resulted in different hiring decisions than increase validity.
those made on the basis of uncorrected scores. We can consider stylistic scales from another
These authors concluded that faking is not a seri- point of view, as a moderator variable. Let’s say
ous threat to the validity of personality tests and we have a scale, such as leadership, and a criterion
that the use of faking corrected scores may be such as participation in collegiate activities like
unwarranted. being treasurer of a social club. We might find that
in a sample of American college students scores
Oblong trapezoids. A rather different and on the scale predict the criterion quite well, but
potentially useful approach is illustrated by in a sample of Chinese students the relationship
Beaber, Marston, Michelli, et al. (1985) who is minimal at best. The variable of culture would
developed a test to measure malingering in moderate the relationship between the scale and
schizophrenic persons. The test consists of three the criterion. Similarly then, we might find that
subscales, including a malingering subscale com- CPI scales predict well in subjects who do not
posed of beliefs that cannot be true because they respond in a stylistic manner (i.e., who score low
are nonexistent (e.g., “God has revealed to me the on such scales), but that the relationship is atten-
truth about oblong trapezoids”) or that present uated (i.e., lowered) for subjects who do respond
atypical hallucinations and delusions (e.g., “I see stylistically (i.e., who score higher on such scales).
colored triangles in my field of vision”). The test In fact, L. R. Goldberg, Rorer, and Greene (1970)
identified 87% of true negatives (schizophrenic undertook such an analysis with the CPI and
patients) and 78% true positives (normal indi- found that the predictive validity of the standard
viduals instructed to malinger). However in two CPI scales was not increased by using any of 13
cross-validations, the test did not work as well stylistic scales as either suppressor or moderator
and a revision was proposed (Rogers, Bagby, & variables.
Gillis, 1992).
Detection of faking. From a statistical point of
Stylistic scales. How useful are stylistic scales, view, there are basically two methods to detect
that is, personality scales that attempt to mea- faking. When we ask a sample of subjects to take
sure one’s personality style, such as impulsivity? an inventory first under standard directions and
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

434 Part Five. Challenges to Testing

then a second time by faking, we can compare the Maslach Burnout Inventory is administered
the mean differences in scores obtained under with the title of “Human Services Survey,” pre-
the two conditions. If there are constant discrep- sumably to eliminate potential bias that the
ancies across subjects attributable to faking, then term “burnout” might activate (Maslach & S. E.
the difference between means should be statisti- Jackson, 1981).
cally significant.
From a research point of view this is a useful Filler items. A number of scales, particularly
and well-used design, but the procedure doesn’t when they are relatively brief and/or when their
tell us anything about individual differences in intent is quite obvious, use filler items, items
faking. To address this issue, we need to look at inserted throughout the scale that are not scored.
the variance (i.e., SD2 ) of the differences between For example, Rotter’s (1967) Interpersonal Trust
scores, or we need to correlate the scores obtained Scale, not only has a disguised title (the General
under the two conditions. If there are individ- Opinion Survey), but of its 40 items, 15 are fillers
ual differences in faking, then the rank order of designed to further obscure the purpose of the
individuals should change over the two condi- scale.
tions (see Gordon & Gross, 1978; Lautenschlager, There is little data available on whether the use
1986). of disguised titles and/or filler items does pre-
vent faking or distortion of responses. At least
TECHNIQUES TO DISCOURAGE FAKING in the case of Rotter’s Interpersonal Trust Scale,
the innocuous title and filler items seem to have
Intentional distortion. When instructed, people no significant impact on its reliability and valid-
are able to distort their responses on personality ity (Kumar, Rehill, Treadwell, et al., 1986). Given
tests in the desired direction. Several approaches that filler items can be difficult to write, increase
have been used to deal with such intentional the length of a test, and make scoring of the test
distortion: somewhat more cumbersome, research needs to
1. Instructions or warnings that distortion can be undertaken on the usefulness of this approach.
be detected and/or punishment will follow (e.g.,
Schrader & Osburn, 1977). Such warnings do
seem to reduce the amount of intentional distor- Forced-choice format. One of the problems
tion, but the research support is limited. There with a forced-choice format is that it results in
is evidence that indicates that different types of lowered reliability. If a response set is operating,
instructions do result in different amounts of fak- the result will be increased reliability – the person
ing (e.g., Longstaff & Jurgensen, 1953). who picks “true” as the answer will continue to do
2. Use of forced-choice items that are equated so across test items and/or test administrations. If
on social desirability. Such scales are however, we now eliminate this response set, reliability will
still open to fakability, and the use of such items also be reduced. Most of the time, this is of little
does not seem to be the solution (e.g., Longstaff concern because the reduction will be minor (not
& Jurgensen, 1953; Waters, 1965). all people respond true all of the time), and there
3. The use of subtle vs. obvious items, i.e., may well be an increase in construct validity –
items for which the underlying construct is not i.e., we are measuring more of what the test really
apparent. is measuring. If however, the reduction in relia-
4. Use of validity scales. Most personality bility is substantial, then with low reliability we
inventories contain such scales; for example, the may not be able to achieve validity. This is in fact a
MMPI, CPI, and Personality Research form dis- criticism that was leveled at the EPPS, which uses
cussed previously all have such scales. forced-choice items to control for social desir-
ability (Levonian, Comrey, Levy, et al., 1959).
Disguised titles. Individual scales, as opposed
to multivariate instruments such as the MMPI Developing faking scales. There are two basic
or CPI, sometimes do use disguised titles so as approaches that parallel the approaches in devel-
not to establish in the client a “set” that might oping any personality scales. The first is the
lead to distortion of responses. For example, empirical approach, where the items chosen for
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 435

a scale are chosen because they show differential English version significantly lower than that for
endorsement. the Spanish version, but a factor analysis indi-
The second is the rational approach. Here the cated that the factor structures were somewhat
items that are to comprise a potential malingering different for the two versions. For the English
scale are selected on the basis of their content. version, two factors were obtained, a factor of
For example, items that reflect highly improbable “impression management” that reflected overre-
behavior (e.g., I read every editorial column in the porting of desirable behaviors and underreport-
paper every day), claims of exceptional virtue (I ing of undesirable behaviors, and a factor of “self-
have never lied in my life), nonexistent behaviors deceptive enhancement,” reflecting a belief that
(I am one of the few persons who understands one is better than he or she really is. In the Spanish
the concept of relative reciprocal tropism). The version, these two factors also appeared, as well
items are then tested empirically to see if, in fact, as an additional two factors (both composed of
they work. too few items to consider here).

Cross-cultural perspective. Several studies have Symptom validity testing. This approach
looked at cross-cultural differences in response involves the presentation of repeated two-
styles. They have been able to document a pref- alternative, forced-choice, discrimination pro-
erence for choosing the extreme categories in blems (e.g., Binder & Willis, 1991). The probabil-
a response scale (extreme response set) or for ity of a given outcome, assuming no knowledge
choosing a “yes” response (acquiescence) among of the correct responses, will conform to a bino-
ethnic or racial minorities such as Hispanics and mial distribution. If one avoids giving correct
African Americans in the United States (e.g., answers (malingers), the score obtained will be
Bachman & O’Malley, 1984; Hui & Triandis, markedly below chance. This is akin to having a
1989). blind-folded person identify the way a coin lands
Marin, Gamba, and Marin (1992) analyzed the over a large number of coin tosses. If the person
responses given to four large data sets (e.g., ques- is truly blindfolded, their guesses should be
tions answered in a study of variables associated correct about half of the time. If they are peeking
with cigarette smoking) by Hispanics and non- through the blindfold, their responses will show
Hispanic whites. Hispanics preferred extreme a greater degree of correctness than we would
responses to a greater degree (e.g., selecting expect on the basis of the binomial distribution.
strongly agree or strongly disagree on Likert Similarly, if they are being negativistic and say
response scales) and answered more items in “heads” when they peek and see “tails,” in the
an acquiescent manner (i.e., endorsing extreme long run their response correctness will be below
agreement such as “very likely”). Such response chance level.
styles were related to acculturation and to edu-
cation, so that the more acculturated and more
RELATED ISSUES
highly educated Hispanics tended to make less
extreme or acquiescent responses. Does format alter scores? In most personality
In general, Latinos obtain higher average inventories that assess many variables, the scale
scores on social desirability scales than do Euro- items are listed randomly, rather than grouped
Americans. One possible explanation is the con- together. Why randomize items? Presumably
cept of “simpatia,” a cultural value that focuses on such randomization reduces biases such as social
the enhancement of smooth interpersonal rela- desirability. However, there are some disadvan-
tionships and the minimization of conflict (e.g., tages. If a subject is answering a questionnaire
Booth-Kewley, Rosenfeld, & Edwards, 1992). whose intent is not clear, there may be a lack of
Shultz and Chavez (1994) administered an 11- trust and less motivation to answer honestly. It
item social desirability scale to a large sample can also be argued that shifting from one item
of job applicants for an unskilled manual-labor in one domain to another in a different domain
position. Some 1,900 applicants completed an creates an intellectual demand that may not be
English form of this scale, and some 600 used met. There is some evidence available, but these
the Spanish form. Not only was the mean on the issues have not been investigated thoroughly.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

436 Part Five. Challenges to Testing

In the area of reliability, for example, Solomon professionals held a stereotype of neuroticism
and Kopelman (1984) looked at three differ- that is, in fact, quite discrepant from the actual
ent item-presentation modes for life-satisfaction behavior of these patients. He analyzed the MMPI
scales and the hypothesized changes in reliability. protocols completed by four samples of neurotic
They found that grouping items increased inter- patients with those obtained from three samples
nal consistency somewhat, but grouping items of normal subjects, primarily college students,
and labeling the subscale resulted in very mod- who were instructed to answer as if they were neu-
est increases in reliability. In the area of valid- rotic patients. Altogether, 74 MMPI items were
ity, Baehr (1953) assessed the impact of grouping identified that statistically discriminated between
together vs. randomly distributing items on atti- patients and dissimulators. These items had to
tude scales and found that test format had no do with multiple physical complaints, feelings of
impact on the discriminant validity of the scales. being misunderstood and being victimized, anx-
Schriesheim and DeNisi (1980) and Schriesheim iety and fear, sexual preoccupations, and other
(1981) looked at leadership scales and found aspects. One item, for example, was “I usually
that grouping the items did impair discriminant feel that life is worth while.” From 59% to 79%
validity. of the patients endorsed the item “true,” but only
from 10% to 36% of the dissimulators said “true”
Positional response bias. This bias refers to (44 of the 74 items were incorporated in the CPI
selecting one response position on multiple- as a dissimulation scale later called the Sense of
choice tests significantly more often, regardless Well-being Scale).
of the item content. The few studies that have The 74 items, considered as a scale, were ana-
been done on positional response bias are incon- lyzed on a new set of samples. Clinical samples
clusive because of various methodological prob- (i.e., patients) did not differ from normal samples
lems, such as failure to randomize the position of (high-school students), but both differed sub-
the keyed responses. In one study with 62 univer- stantially from samples instructed to fake. In fact,
sity students, positional response bias was found by using a cutoff score of 35, 93% of the dissimu-
in 6 students (about 10%), but degree of posi- lators were correctly identified, but only 6% of the
tional bias did not significantly correlate with test clinical cases and 2% of the normal cases scored
scores (Fagley, 1987). at or above this point.

Use of discriminant functions. Regression equa- Can clinicians detect faking? There are many
tions and discriminant functions (like regres- studies available on the capacity of subjects to
sion equations that predict a categorical vari- “fake good” or “fake bad,” especially on person-
able, such as neurotic vs. psychotic or faking vs. ality tests, and the evidence clearly indicates that
not faking) are very useful analyses to assess fak- subjects can fake quite well, but that in most cases
ing. Schretlen, Wilkins, VonGorp, and Bobholz the validity scales to detect such faking also work
(1992) developed a discriminant function from well. There are, however, very few studies on the
combined MMPI, Bender-Gestalt, and malinger- ability of clinicians to detect such faking.
ing scale scores that correctly identified 93.3% of The “classic” study is one done by Albert, Fox,
their subjects, which included 20 prison inmates and Kahn (1980), who showed that experienced
instructed to fake insanity and 40 nonfaking con- clinicians could not distinguish Rorschach pro-
trol subjects. The discriminant function included tocols of psychotic patients from those of nor-
only three variables: the F – K index from the mal individuals who were instructed to fake. On
MMPI, a Vocabulary subset of the malingering the other hand, Bruhn and Reed (1975) showed
scale, and the sum of five Bender-Gestalt indica- that clinicians could accurately identify such dis-
tors of faking. simulation on the Bender-Gestalt, and Goebel
(1983) obtained the same results on the Halstead-
Dissimulation about neuroticism. One of the Reitan.
earliest studies of dissimulation was carried out Yet, consider the following: Faust, Hart, and
by Gough (1954), the author of the CPI. He began Guilmette (1988) tested three normal youngsters
with the observation that both lay persons and with the WISC-R and the Halstead-Reitan. The
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 437

children were instructed to “perform less well Table 16–1. Mean Scores on the MMPI
than usual but not to be so obvious” that their Validity Scales under Different Instructions
faking would be detected. Of the 42 clinical neu- (Cassisi & Workman, 1992)
ropsychologists who reviewed the test protocols, MMPI scales
93% diagnosed abnormality (from two given
L F K
choices: normal or abnormal), and of these 87%
attributed the abnormality to cortical dysfunc- Group instructed to:
tion (from three given choices: cortical dysfunc- Be honest 50 54 48
Fake good 70 49 52
tion, malingering, or functional factors). No clin- Fake bad 55 114 41
icians attributed the results to malingering.
Though each of these studies is limited, for Note: Remember that these are T scores, where 50 is the
expected mean and 10 is the standard deviation.
now we must conclude that clinicians, when judg-
ing test protocols impressionistically are not very
good at detecting faking. The answer quite clearly instructed to fake, whereas in real-life situations
is to use psychometric scales, signs, or indices to such incentives may well be present.
detect malingering because their batting average
is substantially higher. How effective are instructions to fake? In most
studies, instructions to fake are usually explicit,
How important are response sets? Distortion and typically indicate that the subject should
of self-report through response-style bias has do a “believable” job, that is, not be extreme
long been recognized as a potential major and bizarre. Sometimes incentives are offered to
source of systematic error in psychological test- “fool” the judge who might potentially analyze
ing, especially in the assessment of personal- the results. The findings from such studies do
ity and of psychopathology. But how important indicate that subjects given different instructions
are such response sets as social desirability and do produce different results. For example, typi-
acquiescence? Dicken (1963) used the CPI and cal mean scores on the MMPI validity scales are
assessed the role of SD and acquiescence as sup- shown in Table 16.1.
pressor variables; he found that, on the CPI, sig- Note that these are T scores and we would
nificant gains in validity by accounting for good expect average scores to be about 50. That is
impression and social desirability were rare and exactly the case with the standard-instructions
that no gain in validity resulted from suppressing group. Note the elevation of the L scale for the
acquiescence. He thus concluded that the impor- fake-good group, and the elevation of the F scale
tance of these variables in personality inventories for the fake-bad group.
may have been overemphasized.
Rorer (1965) reviewed much of the early liter-
THE MMPI AND FAKING
ature on acquiescence as related to the MMPI and
other measures and concluded that the inference General comments. The original MMPI
that response styles are an important variable included scales designed to identify subjects
in personality inventories was simply not war- who might claim symptoms and problems they
ranted. (For a very different conclusion, see D. N. did not have (i.e., fake bad), or claim positive
Jackson & Messick, 1958.) characteristics they did not have (fake good), or
deny symptoms and problems they really had
Some criticisms. Much of the research in this (fake good also); these scales are collectively
area is limited by a number of design restric- known as the “validity” scales (as opposed to the
tions. Many studies use college students rather clinical scales).
than samples of individuals who may be more Many MMPI studies have been carried out
likely to malinger. Scores of subjects asked to to determine the effectiveness of these valid-
fake are compared with scores of subjects answer- ity scales. A typical design involves a “nor-
ing under standard conditions, rather than with mal” group such as college students who are
those of individuals who are genuinely disturbed. administered the test twice, first with regular
Typically, no incentive is provided to subjects instructions and then with instructions to fake in
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

438 Part Five. Challenges to Testing

a specific direction. In most studies, the validity Items can be omitted for a wide variety of rea-
scales are able to differentiate between valid and sons discussed in detail in the MMPI Handbook
invalid profiles with typical accuracy rates of 80% (W. G. Dahlstrom, Welsh, & L. E. Dahlstrom,
to 98%. 1972). The Handbook suggests that if there is an
Another common approach, which seems elevated Cannot Say score, the subject be inter-
more appropriate, is to compare the MMPI pro- viewed to determine the reason (e.g., suspicious-
tocols of individuals who have been diagnosed ness, lack of understanding, depression, fear of
as mentally ill vs. the protocols of normal sub- loss of privacy, etc.).
jects who are asked to fake bad. The results here,
although typically supportive of the usefulness The Lie (L) scale. This 15-item scale is designed
of the MMPI validity scales and indices, are less to identify deliberate efforts to lie on the test.
impressive. The items involve denial of aggression, preju-
dices, poor self-control, etc., that in fact are rather
common and that most people are willing to
Validity scales. The term “validity scales” refers
admit. All the keyed responses are false. High
to a specific set of scales that are used to deter-
scores on the L scale represent a rather unsophis-
mine whether a specific administration of the test
ticated attempt to fake good, and so scores on this
is valid (that is, appropriate or acceptable). The
scale are related to socioeconomic level, intelli-
term is different from the use discussed in Chap-
gence, and education, with more sophisticated
ter 3, yet it is the same – here we consider validity
persons from higher socioeconomic levels scor-
for the individual client rather than validity of
ing lower. Higher scorers on this scale tend to be
the scale.
rigid, overly conventional, socially conforming
The MMPI contains three basic validity scales:
individuals who are moralistic and unoriginal.
the? scale, the L scale, and the F scale. The?
is simply the number of items omitted by the
The F scale. The earliest validity index on the
respondent. If too many items are not answered,
MMPI was the F scale, designed by the authors
the interpretation of the results is questionable.
to detect deviant response sets. The F scale was
The L scale detects those who have probably lied
derived from a set of 64 items endorsed by less
in the fake-good direction. The F scale consists
than 10% of the normal normative group. The
of items where 90% or more of the normative
keyed responses are the infrequent answers; a
group gave an identical answer; high scores on
high score presumably reflects falsification or
this scale reflect endorsement of rare answers,
random answering. The content of the items is
presumably the result of faking and/or random
quite diverse, covering some 19 different areas
answering.
such as hostility, poor physical health, feelings of
isolation, and atypical attitudes toward religion
The Cannot Say or? scale. Although ordinar- and authority. The scale does not assess why a
ily subjects are encouraged to answer all items person is endorsing the rare answers. Does the
on a questionnaire, the standard MMPI instruc- person not understand the directions? Is there a
tions indicate that items that do not apply can be deliberate intent to fake? Does the person’s men-
omitted. Because omitting many items automat- tal status interfere with honest completion of the
ically lowers the raw scores on the other scales, test? (Scores on the F scale do correlate with scores
it is important to make sure that most items on the schizophrenia scale.)
have, in fact, been answered. J. R. Graham (1990) The F scale has often been used to detect the
recommends that protocols with more than 30 presence of faking bad (e.g., Cofer, Chance, &
items omitted not be interpreted. Raw scores on Judson, 1949). In one study, the F scale by itself
this scale, which reflect the number of items not was the most effective index for identifying fak-
answered, are changed to T scores, but the trans- ing bad subjects (Exner, McDowell, Pabst, et al.,
formation is done arbitrarily rather than statisti- 1963).
cally. Thus raw scores below 30 are equated to a One problem with the F scale is that ele-
T score of 50, and raw scores of 110 are equated vated scores are also associated with blacks,
to a T score of 70. maladjusted individuals, individuals of lower
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 439

Table 16–2. Clinical Diagnosis subtle in content so that their


intent is not readily identified.
Clinical diagnosis
Once the K scale is scored, it is
Normal Abnormal also used as a “correction” factor
True False (i.e., as a suppressor variable) for
Normal Negatives Negatives 5 of the 10 MMPI clinical scales.
MMPI test results For example, a person’s MMPI
False True schizophrenia scale score is com-
Abnormal Positives Positives
posed of the raw score on that
scale plus the raw score on the
socioeconomic status, highly individualistic per- K scale (see the MMPI Handbook for a detailed
sons, marginal reading proficiency, a different discussion).
cultural background, poor cooperation and inat- Not surprisingly, the K scale is psychologically
tention (Hathaway & McKinley, 1943; Schretlen, a rather complex scale. Scores on the K scale
1988). Several attempts have been made to refine are related to defensiveness, but moderate eleva-
the F scale, to take these concerns into account; tions in well-educated individuals can reflect “ego
for example, one was the F-K index. strength,” that is, self-reliance, the ability to cope
Another major problem with the F scale and with challenges, and good interpersonal skills.
similar indices is that different studies report sub- High scores on the K scale may reflect a person
stantially different cutoff scores (the score above who has responded false to most items, or tried to
which faking is presumably present). Such scores fake good, or who lacks self-insight. In a normal
would be expected to differ depending on the individual, a high score may reflect above-average
setting where the testing is done (e.g., a clinic positive characteristics. Low scores on the K scale
vs. a university counseling center), nature of the may reflect a “true” response bias or a fake-bad
client sample (e.g., neurotic vs. psychotic), and response set. They may reflect confusion or sus-
other aspects. picion or a person who is socially awkward and
conforming.
The K scale. The first three validity scales,?, L, Although the MMPI profile and practice is to
and F, were part of the original MMPI when routinely incorporate the K correction, there is
it was published. These scales were, however, actually little evidence that K-corrected scores
fairly “obvious” and did not detect more sub- are more valid predictors than uncorrected scores
tle types of faking. A number of different scales (J. R. Graham, 1990).
were eventually developed that approached the
identification of invalid protocols from differ- The F minus K index. Gough (1947; 1950) deter-
ent directions. One of these was the K scale. mined that individuals who fake on the MMPI
Recall from Chapter 3 the discussion of false pos- and try to create an image of severe psychopathol-
itives and false negatives. If we use the simple ogy score considerably higher on the F scale
dichotomy of test results vs. clinical diagnosis, we than on the K scale. He therefore suggested and
again have our fourfold classification, as shown in empirically evaluated that the difference in raw
Table 16.2. scores between the two scales, i.e., F minus K,
The intent of the K scale is to minimize could be a useful index to detect fake-bad pro-
both false negatives and false positives, without files. In one of the earliest studies of faking
altering the number of true positives and true on the MMPI, Gough (1947) asked 11 profes-
negatives. sionals (psychiatrists and clinical psychologists)
The 30-item K scale was developed empiri- to take the MMPI three times: under standard
cally by comparing the responses of psychiatric instructions, while attempting to simulate severe
patients whose MMPI profiles were normal (but psychoneurosis, and while attempting to simu-
whose L score was at least 60) with the item late paranoid schizophrenia. These profiles were
responses of a group of normal subjects. The then compared with those of actual psychiatric
items incorporated into the K scale also cover patients with the same diagnoses. The faked neu-
a wide range of content areas, but are quite rotic profiles were in fact fairly similar to those
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

440 Part Five. Challenges to Testing

FIGURE 16–1. Comparison of MMPI-2 profiles for fake bad instructions vs. psychiatric
patients (T scores, rounded to nearest whole number). [Based on J. R. Graham, D. Watts,
& R. E. Timbrook (1991). Detecting fake-good and fake-bad MMPI-2 profiles. Journal of
Personality Assessment, 57, 264–277. Reprinted with permission of the publisher; Min-
nesota Multiphasic Personality Inventory-2 (MMPI-2) Profile for Basic Scales. Copyright
1989 the Regents of the University of Minnesota. All rights reserved. “Minnesota Multi-
phasic Personality Inventory-2” and “MMPI-2” are trademarks owned by the University
of Minnesota. Reproduced by permission of the University of Minnesota Press.]

of the neurotic patients, but the F-K index iden- sample with varying diagnoses. Figure 16.1
tified 9 out of the 11 (82%) as faked. Four judges presents the results. As is typical, normal subjects
(including three of the original MMPI authors) who tried to fake bad overreported symptoms
were able to correctly identify from 55% to 73% and difficulties compared to psychiatric patients;
of the 11 profiles as faked. For the psychotic pro- their entire MMPI profile was elevated. Most of
files, the F-K index again identified correctly 9 of the MMPI indices designed to detect faking bad
the 11 (82%) faked profiles, with the four judges worked quite well. The detection of fake-good
achieving success rates from 91% to 100%. profiles was more difficult, but the L scale seemed
In general, differences greater than +10 on the to work relatively well.
F-K index are associated with faking bad, and As mentioned in Chapter 7, several additional
differences less than -10 are associated with fak- validity scales were developed for the MMPI-2.
ing good, although the specific cutoff score varies These include (1) a Back-page Infrequency Scale.
from sample to sample. This 40-item scale parallels the original F scale in
development and was meant to correct the limita-
MMPI-2. Initial results on the MMPI-2 show that tion that all of the F-scale items appeared early in
the same procedures to identify faking on the the test booklet; (2) a Variable-Response Incon-
MMPI are also quite applicable to the MMPI-2. sistency Scale, composed of 67 pairs of items
For example, J. R. Graham, Watts, and Timbrook with either similar or opposite content. The score
(1991) administered the MMPI-2 twice to a reflects the number of pairs of items that are
sample of college students, first with standard answered inconsistently, presumably reflecting
instructions, and secondly with either fake-bad random answering; (3) a True Response Incon-
or fake-good instructions. These protocols were sistency Scale, that consists of 23 pairs of items
compared with those of a psychiatric patient that are opposite in content. The total score is
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 441

computed by subtracting the number of pairs of profiles. They also checked on the accuracy of
items to which the client gives two false responses “decision rules” – for example, the rule that if the
from the number of pairs of items to which the F scale is greater than 80, then call this protocol a
client gives two true responses with either pat- random protocol. Such decision rules were also
tern reflecting inconsistency. A constant of 9 is quite accurate, with half of them correctly iden-
added to the score to remove potentially nega- tifying 100% of the random protocols, and more
tive raw scores, so that the final scores can range than 90%, the nonrandom protocols.
from 0 to 23. Higher scores reflect a tendency to
give true responses indiscriminately, and lower Subtle and obvious keys. Another approach to
scores indicate a tendency to give false responses detect faking is illustrated by the work of Wiener
indiscriminately. These scales are relatively new (1948), who developed subtle vs. obvious keys for
but available studies support their utility (e.g., five of the MMPI scales. Basically, this was done
Arbisi & Ben-Porath, 1998; L. A. R. Stein & J. R. by the author and a colleague on the basis of the
Graham, 1999). manifest content of the items. Items that were
relatively easy to detect as indicating emotional
Consistency of response. Buechley and Ball disturbance if endorsed were considered “obvi-
(1952) pointed out that the F scale is based on ous” items. It was assumed that obvious items
items from the first 300 MMPI items, and not differentiate best between abnormal and normal
from the latter 266 items. Because boredom is of groups, whereas subtle items differentiate best
a progressive nature, a person may well obtain between gradations of normal personality. Using
a “normal” F score, yet still answer randomly on this distinction, Wiener rationally divided five
the latter set of items. Buechley and Ball therefore MMPI scales into obvious and subtle subscales,
developed the Tr (test-retest) scale on the MMPI, and he hypothesized that those who faked bad
which consists of 16 items that are repeated in on the MMPI would endorse more obvious than
the original test booklet. These items were orig- subtle items, while those who faked good would
inally repeated, not to catch individuals who are endorse fewer obvious than subtle items. In gen-
inconsistent, but to allow the early test-scoring eral, Wiener (1948) reported that obvious scales
machines to keep track of the scores. In a sample were highly correlated with each other and had
of 137 juvenile delinquents, presumably uncoop- no correlations with subtle scales. Subtle scales
erative and poorly motivated, the F and Tr scales showed low positive correlations with each other.
correlated +.63, and the authors argued that Although these scales have become quite pop-
the Tr scale provides a basis for identifying sub- ular in the MMPI literature, several reviews raise
jects who respond randomly vs. subjects whose considerable doubt as to the ability of these scales
responses are valid but consistently bizarre. One to identify faked profiles (e.g., D. T. R. Berry, Baer,
advantage of this scale over the F scale, is that it & Harris, 1991; Dubinsky, Gamble, & Rogers,
is unaffected by psychopathology. 1985; Schretlen, 1988). In addition, there is evi-
Another consistency-type scale consists of 12 dence that the standard validity scales of L and F
pairs of items that were judged to be psycholog- appear to be more useful to identify faked profiles
ically opposite in content (R. L. Greene, 1978). (Timbrook, Graham, Keiller, et al., 1993).
Both this scale and the one just described are One of the findings that has been consis-
of interest when “person reliability” becomes an tently reported is that when subjects fake bad
issue, and they have been useful in identifying on the MMPI, the “subtle” items tend to be
MMPI protocols that were the result of ran- endorsed in the opposite of the keyed direction
dom responding, but have not been used widely for psychopathology (e.g., Burkhart, Christian, &
(Grigoriadis & Fekken, 1992). This scale also, is Gynther, 1978; E. Rosen, 1956). This seems to be
independent of psychopathology. due to the face validity for these items, which is
Rogers, Harris, and Thatcher (1983) con- opposite to the keyed direction for psychopathol-
ducted a discriminant analysis of MMPI proto- ogy (Dannenbaum & Lanyon, 1993).
cols, using the above scales as well as the MMPI
validity scales, and found accuracy rates of 90% Faking good vs. faking bad. A number of stud-
or better at correctly classifying random MMPI ies have concluded that the detection of faking
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

442 Part Five. Challenges to Testing

Table 16–3. Percentages of Correct and Incorrect instructions, fake good, or


Identifications (Cassisi & Workman, 1992) fake bad. Using a T score of
Identified as:
70 (2 SDs above the mean),
which is the standard deci-
Honest Fake good Fake bad sion rule on the MMPI, they
Actual instructions: classified the protocols as
Be honest 80% 10% 10% valid or invalid. Their results
Fake good 35% 55% 10% appear in Table 16.3. How-
Fake bad 5% 0% 95%
ever, the three groups were
(Modified from Gough & Bradley, 1996) not significantly different on
the K scale, and the F-K
index yielded an extremely
bad on the MMPI can be fairly accurate, but
high false positive rate: 55% of the standard-
the detection of faking good is less so. A typical
instructions group were misidentified as either
study is that of Austin (1992), who administered
faking good or faking bad (of course, simply
the MMPI-2 to college students with instructions
because the subjects were instructed to “answer
that were either standard, fake good, or fake bad.
honestly” doesn’t mean they did!).
Five indicators of faking were looked at – the L,
F, and K scales, the Gough F-K index, and the
Does social desirability equal mental illness?
difference between obvious and subtle subscales.
Furnham (1986) argued that the reason why tests
The F-K index was the best indicator of fake good,
such as the MMPI that measure mental health are
correctly identifying 90% of the fake good pro-
so susceptible to faking (i.e., correlated with mea-
tocols, and the best indicator of fake bad, cor-
sures of social desirability), is that giving socially
rectly identifying 100% of the fake bad protocols.
desirable responses is, in and of itself, an index
However, this index misclassified more than one
of mental illness. This is an interesting idea, but
third of those who were instructed to respond
seems to be counter to the findings from other
honestly.
tests.

Out of context. If an investigator or clinician is Random and positional response sets. In both
concerned with possible faking, a common pro- adult and adolescent populations, invalid MMPI
cedure is to administer, in addition to the usual profiles, due to either random endorsement or
tests that do not have a faking index (such as the tendencies to endorse all items as true or as false,
Beck Depression Inventory), one or more scales are relatively easy to detect (e.g., Archer, Gor-
to measure the possible presence of such faking; don, & Kirchner, 1987; R. L. Greene, 1980; Lachar,
such scales often come from the MMPI. 1974).
At least two questions come to mind. First, how
valid are such scales when they are not embed- General conclusions. There is an extensive body
ded in their original instrument? And second, of literature on faking on the MMPI; the above
how well do these scales work compared to each studies are just examples. What can we conclude?
other? H. V. Hall and Pritchard (1996), concluded the
Cassisi and Workman (1992) asked college stu- following:
dents to take the complete MMPI-2, and an addi-
tional short form that included the 102 items 1. Normal individuals who fake psychosis can be
that comprise the L, F, and K scales. They then detected fairly readily.
scored these two sets of scales (those in the stan- 2. Psychotics who feign normality can be
dard MMPI and those in this short form). They detected fairly readily.
found correlations of .78 for the L scale, .87 for 3. Psychotics who exaggerate their condition can
the F scale, and .83 for the K scales; these coeffi- be detected.
cients are quite equivalent to the test-retest cor- 4. MMPI indices are better at discriminat-
relation coefficients reported in the test manual. ing between normal MMPI profiles and the
They then asked another sample of students to profiles of normal subjects instructed to fake
take the 102-item form, under either standard psychopathology than they are at discriminating
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 443

genuine patients and normal subjects instructed other than their relationship to social desirability.
to fake psychopathology. Thus above-average scorers may be characterized
5. MMPI indices are even less able to dis- as considerate, cooperative, conscientious, and
criminate between honest genuine patients and industrious. Low scorers are seen as rebellious,
patients who exaggerate. critical, self-indulgent, and distrustful (Gough &
6. Some MMPI indices are consistently better Bradley, 1996).
than others.
The Communality (Cm) scale. The Cm scale was
7. A universal cutoff score to distinguish genuine
developed in a parallel manner to the MMPI F
from faked MMPIs is not available. Specific cutoff
scale, and was originally called the Infrequency
scores must be established to take into account
scale. The scale consists of 38 items chosen
the population of interest and other aspects.
because they are answered in the keyed direc-
tion by 95% or more of the normative subjects.
Low scores then raise the possibility of random
THE CPI AND FAKING
or “nonnormative” responses. In general, high
Sense of Well-being (Wb) scale. This scale, orig- scores confirm the validity of the protocol, and
inally called the Dissimulation scale, was devel- they also suggest that the subject “fits in,” that
oped through external criterion analyses that the person endorses the values and goals of the
compared the MMPI responses of actual neu- culture; higher scores thus suggest undue conven-
rotics with the responses of normal subjects who tionality and conformity. Low scores not only can
were asked to fake neurosis. In the current CPI reflect invalidity of the protocol, but may point to
(Form 434; Gough & Bradley, 1996) the Wb con- feelings of alienation, not “fitting in”; moderately
tains 38 items; these items originally showed large low scores can thus be indicative of a creative,
differences between the neurotic samples and independent individual.
dissimulation samples, but the items also have
identical rates of endorsement between patients The use of regression equations. The CPI has
and normals. The original scale was used with always been marked by a rather sophisticated
the MMPI, but when the CPI was published, approach to test construction and psychometric
the scoring was reversed and the scale retitled aspects, and the same can be said about its valid-
“Well-being.” ity scales. Lanning (1989) developed three regres-
Very low scores on this scale are indicative sion equations to identify invalid CPI protocols,
of faking bad. The scale however is not sim- which are applied in a “decision tree” format. The
ply a validity scale but also reflects personolog- first equation is applied to a test protocol. If the
ical aspects. Thus, high scores tend to be asso- test is declared “valid” then the second and third
ciated with behaviors that reflect productivity, equations are applied.
self-confidence, getting along well with others – The equations for the CPI-434 are given in
in short, a person who has good health and a Table 16.4. To use these equations, one simply
positive outlook on life. uses the raw scores found in the client’s protocol.
On the first equation, if the total score is 60.6 or
Good impression (Gi) scale. The Gi scale con- greater, then the protocol is considered to reflect
sists of 40 items obtained from a pool of 150 fake good. If not, then the next two equations are
items that were administered to high-school stu- taken together and the following decisions are
dents twice – first under standard instructions, made: (1) if the score on the fake bad equation
and second as if they were applying for a very is equal to or greater than 59.5 and the score on
important job. Those items that showed signifi- the random equation is less than 48.01 then the
cant response shifts under the two sets of instruc- protocol is classified as fake bad; (2) if the score
tions were retained for the scale. Most of the on the fake bad equation is equal to or greater
item content reflects obvious claims to favor- than 59.5 but the score on the random equation
able attributes and virtues, or denial of nega- is equal to or greater than 48.01, then the pro-
tive aspects such as failings. As with other CPI tocol is identified as random; (3) If the protocol
scales, there is also a rather complex layer of does not fit any of the above, then it is considered
personological implications attached to scores, valid.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

444 Part Five. Challenges to Testing

Table 16–4. Regression Equations to Identify Invalid CPI developed “stylistic” scales.
Protocols For example, Dicken (1963)
Fake good: 41.225 +.273 Do +.198 Em +.538 Gi −.255 Wb −.168 Fx
developed a 32-item social
Fake bad: 86.613− 1.000 Cm −.191 Wb +.203 Ac −.110 Fx desirability scale, and Lovell
Random: 34.096 +.279 Gi +.201 Wb +.225 Py +.157 Fx (cited by Hase & Goldberg,
Note that these equations involve the three validity scales, and in addition the
1967) developed nine scales
Dominance (Do), Empathy (Em), Flexibility (Fx), Achievement via Conformance such as a “deviance” scale
(Ac), and Psychological Mindedness (Py) scales. The initial number in each equation (items with low rates of
represents a weight so that the equation “balances.” (Lanning, 1989).
endorsement and low social
desirability) and a “hyper-
How well does this procedure work? These communality” scale (items with high rates of
equations were applied to sets of protocols pro- endorsement and high social desirability). Such
duced by faking good or faking bad instructions scales have been used in a very limited way in
or by the use of tables of random numbers or research projects.
by computer-generated random responses. The
results are given in Table 16.5.
Note that each equation worked as it ought- SOCIAL DESIRABILITY AND
to – the fake-bad equation identified most of ASSESSMENT ISSUES
the fake-bad protocols, etc. These figures sug- The concept of social desirability as applied to
gest that by using such equations, fake-good pro- personality assessment represented a major con-
tocols will be correctly identified about 64% of troversy in the 1960s, suffered relative obscurity
the time, fake-bad protocols about 84% of the in the 1970s, and seems to have once again
time, and random protocols about 87% of the come to the fore. Social desirability is generally
time. defined as the tendency for subjects to respond
The CPI manual gives considerable data on to personality-test items in a manner that consis-
the application of these equations to a wide vari- tently presents the self in a favorable light.
ety of samples. The incidence of identified faking A. L. Edwards (1957b) indicated that
is extremely small in most samples. The highest personality-inventory items represent desirable
incidence of fake good, for example, was 7.5% in a and undesirable characteristics and that people
large sample of male applicants for the position of have learned to evaluate the behaviors repre-
police officer. Faking good, faking bad, and ran- sented by the test items, whether or not they actu-
dom responding appear to be highly infrequent ally behave in accordance with these evaluations.
phenomena. He proposed a social-desirability continuum on
which individual statements could be placed and
Stylistic scales. In addition to the standard their social-desirability scale value determined.
scales discussed above, other investigators have These values are normative values in that they
represent the average judgment of a large num-
Table 16–5. Outcome of CPI Regression
ber of judges. Very simply, individuals are asked to
Equations rate personality-inventory items on a 9-point rat-
ing scale, in which all anchors are labeled, rang-
Identified by equation as:
ing from 1 = extremely undesirable, through 5 =
Fake bad Fake good Random neutral, to 9 = extremely desirable. The rating is
Instructions: based on how desirable or undesirable the item
Fake bad is when used to describe another person.
Males 84% 0% 2% Once we have a pool of items that have
Females 78% 0% 4% been rated as to their social desirability, we
Fake good
Males 0% 68% 0%
can administer these items to another sample
Females 0% 58% 0% under standard instructions, that is, “Answer true
Random or false as the item applies to you.” For each
Table 24% 0% 66% item, we can then determine the proportion of
Computer 22% 0% 65% individuals who endorsed that item. Edwards
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 445

(1957b) showed that the proportion of endorse- variable to correct scores on other scales, they
ment was highly correlated to the social desir- are in fact personality measures; they are not
ability. Edwards (1957b) then developed social- measuring a response set but a personality trait
desirability scale of 39 MMPI items, each item (Furnham, 1986; McCrae & Costa, 1983a).
keyed for socially desirable response. Thus, indi-
viduals who obtain high scores on the social- Individual differences. J. S. Wiggins (1968)
desirability scale have endorsed a relatively large pointed out that social desirability can be seen
number of socially desirable responses, and those as a property of scale items or as a variable that
with low scores have given few socially desirable reflects individual differences. In fact, scale items
responses. do differ in the desirability of their responses,
Ratings of social desirability are highly reliable, and such response rates (i.e., how many people
whether we compare different groups of judges endorse an item) are correlated to estimates of
rating the same items, or the same group of judges desirability; more “desirable” items are endorsed
rating the same items on two occasions, with typ- by greater numbers of people.
ical coefficients in the .90s. In fact, even when the The concern of social desirability typically
groups of judges are quite different, such as col- focuses more on individual differences – whether
lege students vs. schizophrenic patients, or they social desirability reflects conscious lying, uncon-
come from different cultures (e.g., Japan and the scious defensiveness, or a need for approval,
United States), the degree of correspondence is self-report instruments will be influenced. Indi-
quite high (Iwawaki & Cowen, 1964; J. B. Taylor, viduals who are high on social-desirability will
1959). score higher on measures of adjustment, con-
scientiousness, and similar variables. Scales of
Meaning of social desirability. The concept of such traits will also be correlated, not necessarily
social desirability has been interpreted in two because the traits are, but because social desirabil-
major ways. In the past, social desirability has ity will cause some individuals to score higher on
generally been interpreted as a contaminant; all scales and some to score lower.
that is, individuals scoring high on a social- How can we determine if a subject is
desirability scale are assumed to be faking dissimulating? One way is to check the self-report
good, and therefore their test scores on the against objective external criteria. If the subject
other scales are considered invalid. Thus, self- says true to the item, “I am an excellent swim-
report scales that correlate highly with social- mer” we can observe his or her performance in
desirability scales are considered invalid. Not as the swimming pool. For most items however,
extreme is the view that although social desirabil- such objective evidence is difficult or impossi-
ity is a contaminant, it can be held in check by ble to obtain. A second approach is to assume
statistically computing that portion of variance that high endorsement of socially desirable items
due to it. Zerbe and Paulhus (1987) argue that represents malingering. However, such a result
social-desirability can be considered contamina- could be due to other factors. Maybe the per-
tion only when the construct of social desirability son is well adjusted and conscientious. McCrae
is unrelated to the construct of interest. A good and Costa (1983a) argue that the evidence of sev-
argument can be made that there are a number of eral studies does not support the hypothesis that
constructs, such as adjustment, which on theo- social-desirability scales measure individual dif-
retical grounds we would expect to correlate sub- ferences in social desirability, and that suppress-
stantially with social desirability. ing the variation in test scores does not increase
Furthermore, social desirability is in and of the predictive validity of a test score. They ana-
itself a meaningful personality dimension that lyzed the test scores on 21 personality traits of a
correlates with a variety of behaviors. For exam- sample of adults and used spouse ratings as the
ple, high social desirability individuals, at least criterion. Self-report and spouse ratings corre-
in the United States, are less aggressive, more lated from .25 to .61; however, when these corre-
likely to avoid social interactions, and less argu- lations were corrected for social desirability, for
mentative (Chen, 1994). Thus, whether or not most traits the correlations between test scores
social-desirability scales are useful as a suppressor and spousal ratings decreased; for only two of
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

446 Part Five. Challenges to Testing

the traits was there an increase, but these were The authors conducted a factor analysis and con-
trivial (e.g., from .35 to .36). In fact, McCrae and cluded that the Edwards and the Jackson scales
Costa (1983a) presented evidence that individu- assess a dimension of social desirability that they
als who scored higher on the social-desirability labeled as “a sense of own general capability,”
scale were better adjusted, friendlier, and more and that the Marlowe-Crowne assessed a sep-
open to experience. arate dimension of “interpersonal sensitivity,”
thus suggesting that SD has a “self” component
Scales of social desirability. At least 15 scales and an “another” component.
have been developed to measure social desirabil-
ity (Paulhus, 1984), but there are three major Components of social desirability. The various
ones: the Edwards scale (A. L. Edwards, 1957b), measures of social desirability can be incorpo-
the Marlowe-Crowne scale (Crowne & Marlowe, rated within a two-factor model, with at least
1960), and the Jackson scale (D. N. Jackson, two such models present in the literature. In one,
1984). the two dimensions are attribution vs. denial –
The Edwards social desirability scale evaluates i.e., claiming socially desirable characteristics for
the tendency of subjects to give socially desirable oneself and denying undesirable characteristics
responses. It consists of 39 MMPI items that were (e.g., Jacobson, Kellogg, Cauce, et al., 1977).
unanimously responded to in a socially desirable A second model argues that the two dimen-
fashion by 10 judges, and correlated substantially sions are self-deception vs. impression manage-
with scale total (i.e., the scale is internally consis- ment – in self-deception the respondent actu-
tent). ally believes the positive self-report, whereas in
The Marlowe-Crowne SD scale measures the impression management, there is conscious fak-
tendency to give “culturally approved” responses, ing (e.g., Paulhus, 1986). Thus self-deception is a
and consists of 33 true-false items selected response style that involves an unconscious ten-
from various personality inventories, items that dency to see oneself in a favorable light, whereas
describe culturally approved behaviors with a low impression management is a conscious presenta-
probability of occurrence. These items reflect cul- tion of a false front (Zerbe & Paulhus, 1987). Most
tural approval, avoid the psychopathological con- of the social desirability scales address impression
tent of the MMPI, are keyed for social desirability management, or a combination of the two. Stud-
on the basis of agreement by at least 9 out of 10 ies suggest that the impression management style
judges, and also showed substantial correlation of responding is in fact used by few job applicants
with scale total. and has a negligible effect on criterion validity
Finally, Jackson’s social desirability scale is a (Schmit & Ryan, 1993).
scale from his Personality Research Form (see Ganster, Hennessey, and Luthans (1983)
Chapter 4); it assesses the tendency to describe argued that there are actually three types of
oneself in desirable terms and to present one- social desirability effects: spuriousness, suppres-
self favorably. This scale consists of 20 nonpsy- sion, and moderation. Social desirability may cre-
chopathological items selected from a larger ate spurious or misleading correlations between
pool of items scaled for social desirability. These variables, or suppress (that is, hide) the rela-
items were chosen to avoid substantial content tionship between variables, or moderate (that
homogeneity, as well as extreme endorsement is, interact with) relationships between variables.
probabilities. In spuriousness, social desirability is correlated
with both the predictor and the criterion. Any
Intercorrelation of social-desirability scales. observed correlation between the two results
Presumably, all three scales measure the same from the shared variance in social desirability
phenomenon and so should intercorrelate sub- rather than some other aspect. Statistically, we
stantially. In fact they do not. In one study of 402 can partial out the effects of such social desirabil-
Canadian college students (Holden & Fekken, ity. In suppression, the social desirability masks
1989) the three scales correlated as follows: the true relationship. By statistically controlling
Edwards vs. Marlowe-Crowne .26; Edwards vs. for social desirability, the relationship between
Jackson .71; Jackson vs. Marlowe-Crowne .27. predictor and criterion increases in magnitude.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 447

In moderation, there is an interaction between absence of such symptoms? They therefore devel-
the independent variable and social desirability. oped their scale from non-MMPI type items.
Here too, this can be assessed statistically. Their initial item pool consisted of 50 items
drawn from a variety of personality instruments.
Can a social-desirability scale be faked? Appar- They retained those items where there was at
ently, the Edwards social-desirability scale itself is least 90% agreement on the part of raters as
influenced by fake good instructions (Furnham to the socially desirable response direction, and
& Henderson, 1982); thus, it may not be a good they also retained those items that discriminated
instrument by which to measure the response bias between low and high scorers. The final scale
of other scales. consisted of 33 items, 18 keyed true and 15
keyed false, for example, “I always try to prac-
Reducing social desirability. Five basic sugges- tice what I preach” (true) and “I like to gossip at
tions to reduce social desirability can be found in times” (false).
the literature:
1. Use a forced-choice format (as in the EPPS), Reliability. Crowne and Marlowe (1960) com-
where the subject must select one of two paired puted the internal consistency reliability (K-R 20)
items equal in social desirability. Unfortunately, to be .88 in a sample of 39 college students, and
this does not eliminate social desirability. For the test-retest (1-month interval) to be .89 on a
one, the items’ social desirability is judged by sample of 31 college students.
mean rating, but may not be equal for a particular
individual. Validity. In the original article (Crowne & Mar-
2. Use items that are neutral in respect to social lowe, 1960) which includes the scale items, the
desirability. Such items are rare and may not Marlowe-Crowne correlated only .35 with the
address important personality dimensions. On Edwards in a sample of 120 college students.
the other hand, items that are so socially “obvi- They also compared both the Marlowe-Crowne
ous” (e.g., “I am basically an honest person”) and the Edwards scale with the standard MMPI
may not be particularly useful from an empir- scales. The results indicated positive correlations
ical validity point of view. between social desirability and the MMPI validity
3. Create a situation where the subject is led to scales, and negative correlations with most of the
believe that whether they are truthful or not can MMPI clinical scales. The pattern of correlations
be detected; this is called the “bogus pipeline” was significantly greater for the Edwards than
technique (E. E. Jones & Sigall, 1971). The eth- for the Marlowe-Crowne scale; the authors inter-
ical implications of this approach are bother- preted these results as indicative that the Edwards
some, even though there is some evidence that scale measured the willingness to endorse
this approach works. neurotic symptoms.
4. Use a lie scale. That is, if you administer a Rather than use this scale as a way of show-
test that does not have built-in validity keys, you ing that other scales are confounded by social
should also administer such scales. desirability (as was done with the Edwards),
5. Ignore the issue. Some writers (e.g., P. Kline, the authors have attempted to show the con-
1986) argue that if a test was properly con- struct validity of their scale by relating scores on
structed and shows adequate validity, the effects this scale to motor skills, attitude changes, self-
of response sets are minimal. esteem, and so on.
Ballard, Crino, and Rubenfeld (1988) inves-
tigated the construct validity of the Marlowe-
The Marlowe-Crowne. Crowne and Marlowe Crowne scale and recommended caution in the
(1960; 1964) criticized the Edwards social- use of this scale. They reported that few of the
desirability scale because of its origin in MMPI scale items were sensitive enough to discriminate
items. They asked that if a subject denied that between high and low scorers, and that many of
“their sleep is fitful and disturbed,” did the the items were no longer keyed in the original
response reflect social desirability or a genuine direction.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

448 Part Five. Challenges to Testing

The Marlowe-Crowne was developed origi- effect of guessing on objective classroom achieve-
nally as a measure of social desirability response ment examinations. If there is no penalty for
style and has been used extensively for this pur- guessing, the student who fails to guess is penal-
pose. In addition, a number of studies indicated ized. It is therefore common practice to correct
that the Marlowe-Crowne could be also used as for chance, success by some subtractive weighing
a predictor of defensive behavior, such as early of wrong answers from right answers. If how-
termination from psychotherapy (e.g., Strickland ever, the number of true and false keyed items
& Crowne, 1963), or willingness to accept neg- is not equal, then a student would still be penal-
ative test feedback (e.g., Mosher, 1965). R. G. ized, depending on whether they tended to guess
Evans (1982) concluded that the scale could be “true” or “false,” when in doubt and the number
used in a variety of clinical assessment contexts of such items.
and reviewed evidence for several psychotherapy- Cronbach hypothesized, and found experi-
related behaviors. mental support for the notion, that “acquies-
The scale has been conceptualized as having cent” students (who guess true more often than
two dimensions, a responsiveness to social pres- false) should score higher on the true item por-
sure as well as defensiveness, or attribution (the tion than on the false item portion. Such guesses
tendency to attribute social desirable character- on true items will increase their score, but on
istics) vs. denial (the tendency to deny socially false items will decrease their score. Further-
undesirable characteristics). Unfortunately, all of more, poor students should guess more often
the attribution items are keyed true, while all of than good students. The net effect is that the
the denial items are keyed false; thus the the- reliability and validity of the score computed
oretical separation is confounded psychometri- on false keyed items will be greater than that of
cally. Furthermore, Ramanaiah and H. J. Martin scores based on true keyed items, and these two
(1980) argued that the two subscales are measur- total scores on a test will correlate to a negligible
ing the same construct and should not be used degree.
separately. Subsequently, Cronbach (1946) defined acqui-
Attesting to the popularity of this scale is that escence as a response set, a tendency to agree with
a number of short forms have been developed. an item regardless of the content of that item.
Strahan and Gerbasi (1972) produced two 10- Rorer (1965) argued quite convincingly that there
item and one 20-item forms. C. R. Reynolds is no evidence to support the existence of acqui-
(1982) developed three short forms with 11 to escence as defined this second way. All we can say
13 items; others divided the original items into is that individuals have somewhat stable guess-
subsets of attribution items (that reflect the ten- ing habits, that there are reliable individual dif-
dency to attribute socially approved but improb- ferences in such habits, and that in a two-choice
able statements to oneself) and denial items (that situation such as true-false items, individuals do
reflect the tendency to deny socially disapproved not respond 50-50.
but probably true statements about oneself, e.g., A number of scales have been developed to
Paulhus, 1984). See Fischer and Fick (1993) for measure acquiescence (e.g., Bass, 1956; Couch &
which of these scales seems most valid. Keniston, 1960), but most of these scales seem
Goldfried (1964) “cross-validated” the proce- to have substantial correlations with measures
dure used by Crowne and Marlowe, but used dif- of social desirability, even when these scales are
ferent judges, different instructions, a reduced made up of neutral items. A. L. Edwards and Diers
item pool, and different criteria for item reten- (1963) were able to create a 50-item scale that cor-
tion. Not surprisingly, the results were quite related only .08 with scores on the Edwards social
different. desirability scale, but this scale (as with most
others) has not been used widely.
ACQUIESCENCE
Controlling for acquiescence. D. N. Jackson
This term was originally used by Cronbach (1942; (1967) argued that acquiescence should be con-
1946) to denote a tendency to agree more than trolled or eliminated when the test is being
to disagree. Cronbach was concerned about the developed, rather than afterwards, and primarily
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 449

through careful construct validation. In gen- sets and/or styles covers, at best, 16% of the vari-
eral, to the degree that there is an imbalance ance of any socially relevant criterion measure.
in the true-false keying of the items in a scale, Why do scores on a personality scale differ
the scale is susceptible to acquiescent respond- from person to person? The question may seem
ing. Thus the solution, at least on the surface, trite and the answer obvious – people differ from
would appear simple: any test constructor should each other, that is, variance is due to individual
attempt to reduce such an imbalance. Items can differences on the trait being measured. How-
be rewritten, so that “I am happy” becomes ever, there are other sources of variance, such as:
“I am not happy.” There is, however, evidence (1) variance due to the particular strategy used to
that “regularly” worded items are the most reli- construct the scale, (2) variance due to the spe-
able and that negative and polar-opposite items cific item characteristics, and (3) variance due to
(e.g., “I am sad”) may in fact lower the reliabil- response styles (J. S. Wiggins, 1968).
ity, if not the validity, of a test (e.g., Benson & 1. Different strategies. Despite the prolifera-
Hocevar, 1985; Schriesheim, Eisenbach, & Hill, tion of instruments devised by different strate-
1991). gies, there is very little by way of systematic com-
Another way to reduce acquiescence is to make parison among different strategies. The study by
sure that test items are clear, unambiguous, and Hase and Goldberg (1967) is somewhat unique.
anchored in specific behaviors. “Do you like You recall that they compared four strategies of
food?” is an item that almost requires a positive scale construction, all using the common item
answer, but “Do you eat four or more meals per pool of the CPI. The results indicated the four
day?” is much more specific and unambiguous. main strategies to be equivalent.
A number of statistical procedures have also 2. Analyses of the variance due to specific
been suggested as a way to measure and control item characteristics have in large part focused on
acquiescence. Logically, for example, we could social desirability. The procedure here, initiated
derive two scores from a scale, one based on by Edwards (1957b), is to have judges estimate
the keyed items, and one based on the num- the social-desirability scale value of specific items
ber of “true” responses given regardless of the using a Likert-type scale. These ratings are then
keyed direction. D. N. Jackson (1967) argued averaged across judges to provide a mean rat-
that such procedures do not generally control ing for each item. Such ratings are quite reliable
acquiescence. and show a fair amount of consistency for dif-
We could also of course, not use true-false ferent groups of judges. There are, at the same
items, but use other variants such as forced- time, significant individual differences on these
choice or multiple-choice procedures, where the ratings.
various alternatives are matched for endorsement 3. Variance due to stylistic consistencies or
frequency. response styles refers to aspects such as the ten-
One way to assess whether acquiescence plays dency to be critical, extreme, acquiescent, and so
a role on a particular scale, is to rewrite the items on. Here also, the focus is social desirability, but
so that the keyed response is false, and to admin- this time the scale of Social Desirability devel-
ister both the original items and the rewritten oped by Edwards (1957b) is typically used. This
items. This has been done for the MMPI, and 39-item scale was developed from the MMPI item
the correlations of the original scales with their pool and from the Taylor Manifest Anxiety Scale
corresponding reversed forms are similar to their (which is itself comprised of MMPI items). In
test-retest reliabilities, that is, acquiescence does fact, 22 of the 39 items come from the TMAS.
not seem to play a major role (E. Lichtenstein & Because anxiety is in fact the major dimension
Bryan, 1965). found in tests such as the MMPI, it is not surpris-
ing that the Social Desirability scale (composed of
anxiety items) should correlate significantly with
OTHER ISSUES
almost any MMPI scale. It is thus not surpris-
Variance in personality tests. J. S. Wiggins ing that Edwards could predict a person’s MMPI
(1968) placed the matter in perspective by indi- scores by equations derived solely from social-
cating that the variance associated with response desirability indices.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

450 Part Five. Challenges to Testing

Acquiescence has also been studied. Part of the do Hohn-lachter”), each item requiring a 7-point
problem is that different measures of acquies- Likert-like response scale. The German words
cence tend not to correlate with each other and were randomly chosen from a dictionary, but the
to be factorially distinct (Wiggins, 1968). meaning of the statements was actually nonsen-
sical, and the subjects did not speak German.
Single vs. multiple approaches. Much of They found that the type of statement signifi-
the earlier literature used direct but simple cantly affected the type of response, with causal
approaches to determine the occurrence of fak- statements and particularly causal total state-
ing. Thus, the investigator either had a predeter- ments, eliciting disagreement. Greater education
mined decision rule (e.g., all profiles that have an on the part of the subject also elicited greater
F score of above 80 are to be classified as fake) disagreement with causal statements.
or looked at a distribution of results and deter-
mined which cutoff score would give maximum
Faking with Other Tests
correct identification of profiles as either faked or
not. Recent studies combine indices, sometimes The Edwards Personal Preference Schedule.
from different tests, to assess faking. A represen- The EPPS, which we discussed in Chapter 4, is
tative example of a well-done empirical study is somewhat unique in that its author (the same one
that by Schretlen and Arkowitz (1990), who gave who spearheaded the social-desirability contro-
a series of tests, including the MMPI and the Ben- versy) attempted to control for social desirability
der Gestalt, to two groups of prison inmates who by having each test item consist of two statements,
were instructed to respond as if they were either equated for social desirability. The subjects select
mentally retarded or insane. Their responses were one statement from each pair as self-descriptive.
compared with those of three criterion groups: Corach et al. (1958) argued that such a procedure
psychiatric inpatients, mentally retarded adults, did not eliminate social desirability. They showed
and prison-inmates controls, all taking the tests that there is a “contextual” effect; the social desir-
under standard conditions. Based on discrimi- ability of a single item can differ when that item is
nant analyses, 92% to 95% of the subjects were paired with another, as is done in the EPPS. Fur-
correctly classified as faking or not faking. thermore, they showed that the degree to which
a choice in each pair was made was highly related
Response set vs. item format. The bias of a (r = .88) to an index of social desirability. On
response set can reside in the individual (and the other hand, Kelleher (1958) concluded that
most of the above discussion makes this assump- social desirability played an insignificant role on
tion) or can be a function of the item for- the EPPS!
mat, that is, a specific item may have a cer- Messick (1960) argued that pairing items
tain “pull” for a response bias. Nunnally and equated in social desirability would be effective
Husek (1958) identified two types of statements: only if social desirability were a unitary dimen-
frequency-type statements and causal-type state- sion and showed that there were nine dimensions
ments. For example, the item, “Most individuals or factors underlying social desirability.
who attempt suicide are depressed” is a frequency
statement, whereas “Depression leads to suicide” Other Lie scales. Besides the MMPI L scale, there
is a causal statement. Each of these types of state- are other lie scales that have been developed. For
ments can be further subdivided into four types: example, the Eysenck Personality Questionnaire
total (all depressed individuals attempt suicide), (H. J. Eysenck & S. B. G. Eysenck, 1975) contains a
unspecified (depressed individuals are suicidal), Lie scale, but several studies suggest that the scale
qualified (some depressed individuals are suici- does not adequately discriminate between sub-
dal) and possible (depressed individuals may be jects who answer honestly and those who don’t
suicidal). These authors wanted to know whether (e.g., Dunnett, Koun, & Barber, 1981; Gorman,
the structure of a statement, independent of the 1968); other researchers conclude just the oppo-
meaning, affected subjects’ responses. To answer site, finding that the Lie scale is able to detect
this they created a phony language examination both positive and negative faking (e.g., Furnham
with key words in German (e.g., “Blocksage can & Henderson, 1982).
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 451

The semantic differential. K. Gibbins (1968) severe symptomatology at the beginning) might
suggested that two response styles operated in influence the subjects’ responses. A sample of
the semantic differential: (1) a tendency to use students were then given the BDI twice, once
the neutral response category either frequently under standard instructions and a retest under
or rarely; and (2) a tendency to make judgments one of four conditions: (1) with an instructional
consistently often in one evaluative direction (i.e., set that the BDI measures clinical depression;
to judge targets as either relatively good or rela- (2) with an instructional set that the BDI mea-
tively bad). Gibbins administered a semantic dif- sures “how you feel with everyday difficulties”;
ferential to a sample of British college women (3) using standard sequence of items; (4) using
who were asked to rate 28 “innocuous” targets sequence of items from most to least severe symp-
such as lace curtains, on six bipolar evaluative toms. The results indicated that item sequence
scales, such as good-bad, presented in random did not affect students’ scores, but the clinical-
order. Some support for the hypothesized rela- depression instructions did have an inhibitory
tionship was obtained, although the actual cor- effect, although very small (the BDI mean for
relation coefficients were modest. The question the depression-instruction group was 5.51 com-
here is not one of faking, but of the influence of pared to a mean of 6.67 for the everyday-problem
response style. instruction sample).
W. G. Dahlstrom, Brooks, and Peterson (1990)
Faking on the Beck Depression Inventory (BDI). developed two alternate forms of the BDI – a
The Beck Depression Inventory (see Chapter 7) backwards form, where the alternatives were pre-
was originally a structured interview in which sented in reverse order, and a random form,
the examiner read to the client 21 sets of state- where the alternatives were presented in a ran-
ments. For each set, the client chose the one dom sequence. These forms, together with other
out of 4 or 5 alternatives that most accurately instruments, were administered to college under-
described the client’s feelings at the moment, with graduate women. The random order BDI resulted
the alternatives presented in order of increasing in a significantly higher mean depression score:
disturbance. This interview was presented as an 11 vs. 8 for the original form and 6 for the back-
advance over self-administered instruments or wards form. These results suggest that the tradi-
other interview-based rating methods, and the tional BDI format is susceptible to a “position”
validational data seems to support this claim, response set, where the subject selects the first
even though now the BDI is routinely used as (or last) alternative, rather than carefully think-
a self-report scale rather than as a structured ing about which of the various alternatives best
interview. describes his or her current situation. At the same
The initial validational studies were conducted time, the pattern of correlations between each of
on patients, individuals who had implicitly or the three versions of the BDI and several other
explicitly admitted that they were experienc- instruments, such as the Depression scale of the
ing difficulties. The motivation to dissimulate in MMPI, was highly similar, suggesting that the
some way was probably minimal. In more recent random scale may be a methodological improve-
applications of the BDI, it may be administered ment, yet add little to the criterion validity.
to large samples of subjects in either research or
applied settings where the motivation to dissim- The Millon Clinical Multiaxial Inventory (MCMI).
ulate may be enhanced, for example, in mass test- Another test discussed in Chapter 7 was the
ing of entering college students by student health MCMI, which has become recognized as one of
staff or required participation in an experiment the best instruments to assess personality disor-
as part of an introductory psychology course. ders. A number of investigators have looked at
Kornblith, Greenwald, Michelson, et al. (1984) susceptibility to faking on the MCMI, including
hypothesized that college students might be the use of subtle vs. obvious subscales, with gen-
less willing to endorse depressive items if they erally positive results for most but not all of the
thought the BDI measured severe pathology MCMI subscales (e.g., Bagby, Gillis, & Dickens,
rather than everyday problems; they also hypoth- 1990; VanGorp & R. Meyer, 1986; Wierzbicki &
esized that the order of the items (with more Daleiden, 1993).
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

452 Part Five. Challenges to Testing

The Psychological Screening Inventory (PSI). normal individuals who deliberately attempt to
Lanyon (1993) developed four scales to assess claim very superior mental-health adjustment.
deception on the PSI (see Chapter 7), two to These items showed significant response shifts
assess faking bad (symptom overendorsement from the standard instructions to the fake-good
and erroneous psychiatric stereotype) and two instructions for a sample of 100 college students.
to assess faking good (endorsement of superior Lanyon (1993) showed that for each of these
adjustment and endorsement of excessive virtue). scales, simulated deception (i.e., instructions to
The Symptom Overendorsement scale consists fake) significantly altered the mean scores, that
of 26 items and attempts to assess the extent to the correlations between the fake-good and fake-
which a person in a normal population may indis- bad scales were negative, and the correlations
criminately endorse symptoms of psychopathol- within each domain (e.g., fake good) were posi-
ogy. The chosen items empirically discriminated tive, although in part the magnitude of the cor-
between responses of introductory psychology relation was due to item overlap between scales.
students asked to answer under standard instruc-
tions and under instructions to fake bad. These Integrity tests. In Chapter 14 we discussed
items have face validity (that is, they are obvi- integrity tests, tests used to identify potentially
ous in their nature, such as “hearing voices”). dishonest employees. How susceptible to faking
Malingerers who indiscriminately endorse psy- are such tests? A firm answer cannot be given
chopathology would be expected to score higher because such tests are not easily available to inter-
than actual patients, who endorse only the psy- ested researchers, but the literature does contain
chopathology items that apply to them. This scale some illustrative studies.
is like the F scale of the MMPI, except that MMPI Sackett and Harris (1984) reviewed the litera-
items have a low frequency of endorsement by ture on personnel honesty testing and concluded
normals (less than 10%). that research was needed to determine the faka-
The Erroneous Psychiatric Stereotype scale, bility of such tests.
also made up of 26 items, aims to distinguish Ryan and Sackett (1987b) administered an
between actual psychiatric patients and persons honesty test they developed for their study, under
who claim to be psychiatrically disturbed. These three conditions: respond honestly, fake good,
items empirically discriminated the responses of or respond as if applying for a job. The honesty
college students instructed to fake bad and the test, modeled after “real” honesty tests, contained
responses of a sample of psychiatric inpatients, three subscales: a theft attitude scale, a social
mostly schizophrenics. desirability or lie scale, and an admission scale.
The Endorsement of excessive virtue scale Scores of college students in the fake-good con-
began with 42 PSI items identified by expert dition differed from the scores in the other two
judges to be consistent with a claim of excessive conditions, while the scores of participants in the
virtue. These items reflected claims of being com- “respond honestly” and “respond as job appli-
pletely above reproach in behavior, being thor- cant” differed only on the theft attitude scale.
oughly trustworthy and honest, and so on. The Thus subjects responding as if they were applying
percentage of endorsement for these items by the for a job basically seem to respond truthfully.
expert judges were computed. Then the percent- A related issue is the relationship between
age of endorsement of each item given by the integrity tests and intelligence. The hypothesis
original normative group (1,000 individuals cho- suggests that more intelligent applicants pre-
sen to represent some aspects of Census data) sumably are more likely to understand the pur-
was computed and compared. Items where the pose of the test and therefore attempt to appear
judges’ endorsement and the normative endorse- more honest; but in fact, there is no relation-
ment differed by at least 40% were retained. This ship between indices of intelligence and scores
yielded a 34-item scale – i.e., items judged to rep- on integrity tests (S. H. Werner, Jones, & Steffy,
resent excessive virtue and, in fact, selected by few 1989).
individuals. K. M. May and Loyd (1994) administered
The Endorsement of superior adjustment scale a trustworthiness scale and an attitude-about-
consists of 27 items, and attempts to identify honesty scale to college students in one of two
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 453

conditions: respond as honestly as possible or nonverifiable items. Some studies on the faka-
respond as if applying to graduate school or for bility of these items have found little response
a job. No significant differences were obtained bias (e.g., Cascio, 1975; Mosel & Cozan, 1952),
for the trustworthiness scale, but the results while others have (e.g., Goldstein, 1971; S. P.
indicated that students modified slightly their Klein & Owens, 1965; Schrader & Osburn, 1977;
responses on one of the four subscales of the D. J. Weiss & Dawis, 1960). Some studies have
attitude scale under the condition of applying reported that subjects can improve their scores
to graduate school, but not under the condition when instructed to do so, while other stud-
of applying for a job. Although the authors inter- ies show that, in actual practice, relatively lit-
pret their findings as supportive of the hypothesis tle faking occurs, particularly on biodata items
that students modify their responses based on the that can be verified (T. E. Becker & Colquitt,
purpose of the testing, in fact the support is min- 1992). Mumford and Owens (1987) speculated
imal, and the results are more in line with the that such differences may be due to the item-
conclusion that such tests are fairly robust as far keying strategies used in scoring the particular
as faking good is concerned. biodata questionnaire.
Cunningham, Wong, and Barbee (1994) con- There are two major strategies used – the item-
ducted three experiments to assess “impression keying strategy and the option-keying strategy. In
management” on the Reid Report. In the first the item-keying strategy the alternatives to each
experiment, subjects were encouraged to try and item are scored in such a way that assumes a linear
present themselves as honestly as possible; their relationship between item and criterion. Thus,
responses were compared with those of a sample if the choices are the typical Likert responses –
participating in a research project and a sample e.g., “I am a take-charge person” – strongly agree,
of employment applicants. The instructed sam- agree, not sure, disagree, strongly disagree – the
ple scored higher than the research sample, but item is scored from 5 to 1 if there is a positive
not higher than the job-applicant sample. correlation with the criterion (and 1 to 5 if the
In a second study, subjects were offered a correlation is negative).
monetary reward for obtaining high scores, and With the option-keying strategy, each alterna-
different types of instructions providing spe- tive is analyzed separately, and scored only if that
cific information about the constructs involved alternative significantly correlates with the cri-
in the Reid Report were used. Again, subjects terion. Typically, we might analyze the responses
who were instructed scored higher than con- given by contrasted groups (e.g., successful insur-
trol subjects, but no different from job appli- ance salespeople vs. unsuccessful ones), and score
cants. Finally, in the third study, subjects were those specific alternatives that show a differential
instructed to respond as if “they seriously wanted response endorsement with either unit weights
a job.” After the test, they were overpaid for (e.g., 1) or differential weights (perhaps reflecting
their participation, and note was made as to the percentage of endorsement or the magnitude
whether they returned the overpayment. High of the correlation coefficient).
scorers were significantly more likely to dis- Although both scoring procedures seem to
play integrity by returning the overpayment. The yield comparable results from a validity point
authors concluded that integrity tests possess pre- of view, items scored with the option-keying
dictive validity, despite the possibility of some procedure are less amenable to faking because
response distortion associated with impression the respondent does not know which option
management. yields the maximal score for that item. That is,
Alliger and Dwight (2000) analyzed 14 stud- this procedure could well yield items where a
ies and concluded that overt integrity tests were “strongly agree” response is scored zero, but an
susceptible to fake good and coaching instruc- agree response is scored +1.
tions, while personality-based measures were Kluger, Reilly, and Russell (1991) investigated
more resistant. the fakability of these two procedures, as well as
the fakability under two types of instructions:
Biodata. You will recall that biodata instru- general instructions of applying for a job vs.
ments can contain either or both verifiable and specific instructions of applying for the job of
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

454 Part Five. Challenges to Testing

retail-store manager. Their results indicated that concluded that the Rorschach could not be faked,
when subjects simulated responding to biodata but used flawed statistical methodology (Cron-
items as general job applicants, their responses bach, 1949). In the 1950s and 1960s, several
were distorted in a socially desirable direction. studies concluded that there was some degree of
Item-keyed scores were susceptible to inflation susceptibility to faking; for example, Carp and
due to socially desirable responding and to spe- Shavzin (1950) indicated that under instructions
cific job-title instructions; option-keyed scores to fake good or fake bad, subjects could vary the
were not. results.
These authors suggest that key developers Seamons, Howell, Carlisle, et al. (1981) admin-
should routinely check whether a combination of istered the Rorschach to prison inmates, many
both types of keys, improves the validity of a spe- with a diagnosis of schizophrenia, with instruc-
cific biodata questionnaire. They also suggested tions to fake good or fake bad. A number of
that the effects of faking on validity may depend response aspects on the Rorschach were influ-
on the specific job performance being predicted; enced by the instructions, but expert judges were
jobs that require socially desirable behavior (e.g., able to differentiate correctly between the pro-
working with others) may be better predicted tocols of those who were asked to appear nor-
by item-keying strategies, while jobs that do not mal or asked to fake psychosis. As indicated in
require such behavior may be better predicted by Chapter 15, at least one recent study found that
option-keying strategies. the Rorschach is indeed subject to malingering,
A different approach was taken by J. B. Cohen at least in the case of paranoid schizophrenia. A
and Lefkowitz (1974) who administered a biodata further study by the same authors (M. Kahn, Fox,
questionnaire and the MMPI-K scale to a sam- & Rhode, 1988) indicated that computer analysis
ple of 118 job applicants. They used the K score of Rorschach protocols was as susceptible to fak-
as a measure of faking and analyzed the biodata ing as the results of their earlier study; the com-
results for those scoring above the median on puter scoring system only identified 10% of the
the K scale vs. those scoring below the median. psychotic protocols as psychotic, and still misdi-
They obtained 14 biodata items that differenti- agnosed from 53% to 80% of the faked protocols
ated the two, which when taken as a scale corre- as psychotic. (For criticism of this study see J. B.
lated .66 with the K score. They interpreted these Cohen [1990] and M. Kahn, Fox, & Rhode’s reply
biodata items as predictive of fakability. Note, [1990].)
however, that the biodata profile of a supposed Perry and Kinder (1990) reviewed the lit-
dissimulator is that of a person who endorses erature and concluded that (1) when subjects
the following: married man, considers himself are instructed to malinger on the Rorschach,
“middle-of-the-road” politically, is not bothered they give fewer responses; (2) because analyses
when interrupted in his work, tends to ignore of responses on the Rorschach depend on how
the bad habits of others, masters difficult prob- many responses are given, further analyses must
lems by independent reading, has not achieved control for this, but they do not. Hence, any
in sports, has parents who are in business and are findings reported in the literature are for now
satisfied with their home, and called a physician inconclusive; (3) no reliable pattern of respond-
when he was ill. While this portrait may not be ing has been found to be related to malingering
an “exciting” one, it suggests middle-class stabil- across studies; and (4) college students instructed
ity and solid adjustment, rather than a scheming to fake cannot be equated to patients who might
faking approach to the world, and in fact it sup- be motivated, for a variety of reasons, to fake.
ports the validity of the K scale as a measure of
“ego strength.” (For other studies see T. E. Becker Sentence-completion tests. Of the various pro-
& Colquitt, 1992; S. P. Klein & Owens, 1965.) jective techniques, sentence-completion tests
seem to be among the most valid, yet also
Faking on the Rorschach. Early studies on the most susceptible to faking. Timmons, Lanyon,
Rorschach seemed to suggest that the Rorschach Almer, et al. (1993) administered a 136-item
could not be faked. One of the earliest inves- sentence-completion test to 51 subjects involved
tigators was Fosberg (1938, 1941, 1943), who in personal-injury litigation. Based on the
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 455

literature, they developed 12 categories of The discriminant function based on the Wechsler
malingering characteristics such as exaggerated Memory Scale-Revised was able to identify cor-
confidence in the doctor, excessive focus on prob- rectly all 26 standard-instruction subjects, and
lem severity, and so on. They scored the protocols misidentified 7 of the 31 malingering subjects as
and carried out a factor analysis. Three factors standard. Similar results are reported by Mitten-
were identified: one that represented a posture of berg, Azrin, Millsaps, et al. (1993).
anger and resentment, a second that represented a Intelligence is also often affected in patients
disability too severe for future employment, and a who have sustained head trauma, and there-
third that reflected exaggerated claims of compli- fore a comprehensive neuropsychological exam-
ance and honesty. Correlations with MMPI items ination, such as the Halstead-Reitan, usually
and a cross validation seemed to show both con- includes the administration of an intelligence
vergent and discriminant validity for these three test, typically the WAIS. A number of studies have
scales. Thus, as we saw in the MMPI and CPI, shown that malingered and valid WAIS profiles
scales designed to assess faking can also have addi- can be distinguished on the basis of configural
tional personological implications of their own. aspects (subtest patterns). In one study (Mitten-
berg, Theroux-Fichera, Zielinski, et al., 1995),
Neuropsychological testing. Neuropsychologi- WAIS-R protocols obtained from a sample of
cal testing is often performed when there are nonlitigating head-injured patients (and there-
questions of head trauma, perhaps due to auto- fore lacking motivation to fake bad) were com-
mobile or industrial accidents, and thus the issue pared with those of a sample of normal subjects
of financial compensation may be of impor- instructed to malinger head-trauma symptoms.
tance. Under such circumstances the possibility A discriminant function was able to accurately
of malingering or, at the very least, exaggerating classify 79% of the cases, with 76% true pos-
of symptoms is a very real one. itives, and 82% true negatives. A discriminant
The issue is a very complex one, but a general function based on the difference between only
conclusion is that impaired levels of neuropsy- two subtests, Vocabulary and Digit Span, also was
chological test performance can be simulated successful in 71% of the cases.
by individuals who attempt to fake the symp- Binder (1992) concluded that normal subjects
toms of head trauma, but convincing patterns of if instructed to simulate brain damage can do
impaired test performance may be more difficult so relatively well, but not completely. On some
to fake (Mittenberg, Azrin, Millsaps, et al., 1993). tests quantitative and qualitative differences do
In this area of testing also, the typical approach exist between genuine brain-damaged patients
is to have samples of normal subjects who and normal simulating subjects. Binder also con-
take the specific tests under standard instruc- cluded that clinicians have poor rates of detection
tions vs. samples who are asked to simulate of malingering on traditional neuropsychologi-
malingering. Memory disorder is likely to be cal measures.
a prominent head-trauma symptom. Several
investigators have focused on susceptibility of The Halstead-Reitan battery. As discussed in
malingering on such tests, particularly on the Chapter 15, the Halstead-Reitan battery is used to
identification of patterns of performance that assess cognitive impairment due to head trauma,
may be used to identify invalid protocols. often in situations where liability or criminal
Bernard, Houston, and Napoli (1993), for responsibility may be at issue; thus there may be
example, administered a battery of five neuropsy- substantial financial or other incentives to fake,
chological tests that included the Wechsler Mem- especially fake bad.
ory Scale-Revised (see Chapter 15) to a sample Several studies have shown that simulated and
of college students, with approximately half tak- actual impairment can be distinguished by sig-
ing the tests under malingering instructions. Two nificant differences on several of the subtests, but
discriminant equations were calculated and used the results vary somewhat from study to study,
to identify protocols as either “standard instruc- in part because of small sample fluctuations.
tions” or “malingering.” The overall accuracy Mittenberg, Rotholc, Russell, et al. (1996) did
rate for the two equations were 88% and 86%. study a sizable sample (80 patients and 80 normal
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

456 Part Five. Challenges to Testing

subjects with instructions to malinger) and did I. Sarason (1980) suggested that there are five
find that a discriminant function with 10 vari- characteristics of test anxiety:
ables correctly identified 80% of the protocols,
with 84% true positives and 94% true nega- 1. The test situation is seen as difficult and threat-
tives. Other more complex approaches, includ- ening.
ing identifying test results that are significantly 2. The person sees himself or herself as ineffective
below chance, have been developed with promis- to cope with the test.
ing results (e.g., Binder & Willis, 1991; Pankratz, 3. The person focuses on the undesirable conse-
Fausti, & Peed, 1975; Prigatano & Amin, 1993) quences of being personally inadequate.
4. Self-deprecation interferes with possible
Other instruments. Studies of the fakability of solutions.
other instruments do not fare as well. For exam- 5. The person expects and anticipates failure and
ple, on the Philadelphia Geriatric Center Morale loss of regard by others.
Scale (discussed in Chapter 10), scores correlated
.70 with the Edwards Social Desirability Scale Thus, the test-anxious individual performs
in one study (Carstensen & Cone, 1983). As the more poorly under evaluative and stressful sit-
authors indicated, it is to be expected that a mea- uations, such as classroom exams. If the situa-
sure of psychological well-being should correlate tion is not evaluative or stressful, then there seem
with social desirability because it is socially desir- to be no differences between high- and low-test-
able to be satisfied with life and experience high anxious individuals (Sarason, 1980). For a review
morale. But such a high correlation calls into of the measurement and treatment of test anxiety,
question the construct validity of the scale. see Tryon (1980).

The Test Anxiety Questionnaire. The first mea-


TEST ANXIETY
sure to be developed was the Test Anxiety Ques-
Test anxiety, like anxiety in general, is an unpleas- tionnaire, originally a 42-item scale, later revised
ant general emotion, where the individual feels to 37 items (Sarason & Mandler, 1952). The
apprehensive and worried. Test anxiety is a gen- respondent indicates for each item the degree of
eral emotion attached to testing situations, sit- discomfort experienced. Reliability for this mea-
uations which the individual perceives as evalu- sure is excellent (e.g., .91 for split-half), and its
ative. Test anxiety is a major problem for many construct validity is well supported in the litera-
individuals; it can also be a major problem from a ture.
psychometric point of view, because it can lower
a student’s performance on many tests. The Test Anxiety Scale. I. Sarason (1958) devel-
During the 1950s, there was a proliferation of oped a 37-item Test Anxiety Scale that has also
scales designed to assess general anxiety (Sara- become a popular scale. This scale has under-
son, 1960). One of the major ones was the Tay- gone several changes, but originally consisted of
lor Manifest Anxiety Scale (J. A. Taylor, 1953). items rewritten from the Test Anxiety Question-
This scale was a somewhat strange product. Clin- naire. Typical items are: “I get extremely worried
ical psychologists were asked to judge which when I take a surprise quiz” and “I wish taking
MMPI items reflected a definition of anxiety and tests did not bother me as much.” This scale, in
these became the Taylor, but the scale was really turn, resulted in at least three other instruments:
designed to measure “drive” – an important but the Worry-Emotionality Questionnaire (Liebert
generic concept in learning theory. Subsequently, & Morris, 1967), the Inventory of Test Anx-
the research focused on specific types of anxiety, iety (Osterhouse, 1972), and best known, the
such as social anxiety and test anxiety. Seymour Spielberger Test Anxiety Inventory (Spielberger,
Sarason and his colleagues (Sarason, Davidson, 1980).
Lighthall, et al., 1960) developed the Test Anxiety Test anxiety is seen by some researchers as
Scale for Children, which became the first widely a special case of general anxiety. One popular
used test-anxiety instrument. theoretical model is the state-trait model of test
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 457

anxiety (Spielberger, 1966) where state anxiety is the coefficients were very low, indicating lack of
seen as a transitory phenomenon, a reaction to a stability, but is the lack of stability a function
particular situation, whereas trait anxiety refers of the test or a function of the construct? Does a
to a relatively stable personality characteristic. test-anxious child stay anxious, or does that child
There seems to be general agreement that test learn to cope and conquer the anxiety?
anxiety results from the child’s reactions to eval- In general, test anxiety seems to be com-
uative experiences during the preschool and early posed, at least theoretically, of two separate
school years (Dusek, 1980). A number of studies aspects: worry and emotionality (e.g., Deffen-
have shown a negative relationship between test bacher, 1980). Self-report scales to measure these
anxiety and performance on achievement tests, two components have been developed (e.g.,
although some have not (e.g., Tryon, 1980). Liebert & Morris, 1967; Osterhouse, 1972).

The Test Anxiety Scale for Children (TASC). The The Test Anxiety Inventory (TAI). This Inven-
TASC is probably the most widely used scale to tory (Spielberger, 1980) is one of the major mea-
assess test anxiety in children. The scale consists sures used to assess test anxiety. The TAI yields
of 30 items to which the child responds “yes” a total score, as well as two separate subscores
or “no.” The examiner reads the questions out indicating worry and emotionality. The TAI has
loud, and the TASC can be group-administered. been used widely with both high-school and col-
The scale seems to have adequate reliability lege students (Gierl & Rogers, 1996) and has
and validity (e.g., K. T. Hill, 1972; Ruebush, been translated or adopted into many languages
1963). including Italian (Comunian, 1985) and Hun-
One basic issue concerns a child’s willingness garian (K. Sipos, M. Sipos, & Spielberger, 1985);
to admit to anxiety. Sarason and his colleagues these versions seem to have adequate reliability
developed two scales – a Lie scale composed of 11 and construct validity (see DeVito, 1984, for a
questions related to anxiety to which the major- review).
ity of children answer “yes” (e.g., “Do you ever
worry?”), and a Defensiveness scale composed of
TESTWISENESS
27 items that assess a child’s willingness to admit
to a wide range of feelings (e.g., “Are there some Testwiseness or test sophistication refers to a per-
persons you don’t like?”). These two scales are son’s ability to use the characteristics and format
highly correlated and are usually given together as of a test or test situation, to obtain a higher score
one questionnaire. The child’s total defensiveness independent of the knowledge that the person
score is the number of items answered “no.” Such has. Put more simply, there are individual differ-
scores do correlate about −.50 with TASC scores, ences in test-taking skills. Research suggests that
that is, highly defensive children tend to admit to test-wiseness is not a general trait, is not related to
less anxiety. Therefore, it is suggested that, in a intelligence, but is clue-specific, that is, related to
research setting, the scores of those children who the particular type of clue found in the test items
score above the 90th percentile of defensiveness (Diamond & Evans, 1972; T. F. Dunn & Goldstein,
not be considered. 1959). Apparently, even in sophisticated college
The TASC is clearly a multidimensional instru- students, individual differences in testwiseness
ment, with most studies identifying some four may be significant; Fagley (1987) reported that
factors; yet at the same time, only the total score in a sample studied, testwiseness accounted for
is considered (Dusek, 1980). about 16% of the variance in test scores.
Sarason and his colleagues (e.g., K. T. Hill Millman, Bishop, and Ebel (1965) analyzed
& Sarason, 1966) conducted a 5-year longitu- testwiseness into two major categories – those
dinal study of first and second graders, who independent of the test and those dependent
were administered the TASC and the two lie- on it. Independent aspects include strategies to
defensiveness scales in alternate years. Test-retest use test time wisely, to avoid careless errors, to
correlations over a 2-year period were modest make a best guess, and to choose an answer using
(primarily in the .30s); over a 4-year period deductive reasoning. Dependent aspects include
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

458 Part Five. Challenges to Testing

strategies to interpret the intent of the test con- SUMMARY


structor and the use of cues contained within the
We have looked at the issue of faking in some
test itself.
detail. In part, because there is a rather large body
of literature on this topic, and in part because a
Experimental test. One of the tests developed to common concern about tests is whether they can
measure testwiseness is the Gibb (1964) Experi- be faked. The evidence suggests that the incidence
mental Test of Testwiseness, which is composed of faking is rather low, and well-constructed
of 70 multiple-choice questions. The questions questionnaires can indeed identify various types
appear to be difficult history items, but can be of faking with some degree of success. At the same
answered correctly by using cues given within time, our discussion should alert us to the need
the test question or the test itself. There are seven to get a subject’s full cooperation, to act for the
types of cues, including alliterative association benefit of the client, and yet maintain a healthy
cues in which a word in the correct answer alter- prudence when we use test results.
native sounds like a word in the item stem, and
length cues where the correct alternative is longer SUGGESTED READINGS
than the incorrect alternatives. The reliability of
Cofer, C. N., Chance, J., & Judson, A. J. (1949). A study
the test is adequate though low −.72 for K-R20 of malingering on the Minnesota Multiphasic Person-
and .64 for a test-retest with a 2-week period ality Inventory. Journal of Psychology, 27, 491–499.
(Harmon, D. T. Morse, & L. W. Morse, 1996). The
This is an “old” but fairly representative study of the early
validity has not yet been established, although a research in this area. The authors used the ubiquitous students
factor analysis suggested that the test could be in an introductory psychology course (N = 81) who took
characterized as tapping a general proficiency the MMPI under standard instructions and under malinger-
ing instructions (fake good or fake bad). The article is quite
in testwiseness (Harmon, D. T. Morse, & L. W. readable, and the statistical operations simple and easy to
Morse, 1996). follow.

Eliminating testwiseness. Testwiseness is pri-


Ganellen, R. J. (1994). Attempting to conceal psy-
chological disturbance: MMPI defensive response sets
marily a reflection of poorly written items and
and the Rorschach. Journal of Personality Assessment,
can be substantially if not totally eliminated 63, 423–437.
by:
The subjects of this study were commercial airline pilots who
were required to undergo an independent psychological eval-
1. Avoiding use of a name or phrase repeated uation after completing a treatment program for alcohol or
in both the stem and the correct alternative. For substance abuse. Because the results of the evaluation would
example: “The German psychologist Wundt was have a bearing on whether their pilots’ licenses would be rein-
affiliated with the university of: (a) Leipzig, Ger- stated, there was considerable incentive to fake good. Did
they? Did the potential defensive response set affect the MMPI
many (b) Zurich, Switzerland . . . etc.” and the Rorschach? Read the article to find out!
2. Not using specific determiners (such as “all,”
“never”) in the distractors. For example: “Wundt: Heilbrun, K., Bennett, W. S., White, A. J., & Kelly, J.
(a) never visited other universities, (b) disliked all (1990). An MMPI-based empirical model of malinger-
French psychologists . . . etc.” ing and deception. Behavioral Sciences and the Law, 8,
45–53.
3. Not using a correct alternative that is longer.
This study is also in many ways illustrative of the current
For example: “Wundt was: (a) French, (b) a neu- research on the MMPI and malingering and makes a nice
rosurgeon, (c) considered the father of experi- contrast with the Cofer et al. (1949) article.
mental psychology . . .”
4. Not giving grammatical clues in the stem. Lees-Haley, P. R., English, L. T., & Glenn, W. J. (1991).
For example: “Cattell was a of Wundt. (a) A fake bad scale on the MMPI-2 for personal injury
claimants. Psychological Reports, 68, 203–210.
acquaintance, (b) enemy, (c) student . . .”
The scale developed in this study was “inspired” by Gough’s
5. Not using overlapping distractors. For Dissimulation Scale, which, as we have seen, was originally
example: “Wundt had at least doctoral stu- developed on MMPI items, but is part of the CPI rather than
dents. (a) 5, (b) 10, (c) 15 . . .” the MMPI.
P1: JZP
0521861810c16 CB1038/Domino 0 521 86181 0 March 4, 2006 14:22

The Issue of Faking 459

DISCUSSION QUESTIONS 4. Do you think that most people fake when they
are taking a psychological test such as a person-
1. How would you define “stylistic” variables?
ality inventory?
2. One technique to discourage faking is the use
of filler items. How might you go about deter- 5. Is the nature of faking different for different
mining whether such a technique works? types of tests? (e.g., consider a career-interest test
such as the Strong vs. an intelligence test such as
3. Compare and contrast the F scale and the K
the WAIS).
scale of the MMPI.
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

17 The Role of Computers

AIM This chapter looks at the role of computers in psychological testing. Comput-
ers have been used as scoring machines, as test administrators, and recently as test
interpreters. We look at the issue of computer-based test interpretation (CBTI) and
questions about the validity of such interpretations. We consider ethical and legal
issues, as well as a variety of other concerns. The role of computers in testing is a very
“hot” topic currently, with new materials coming out frequently. Entire issues of pro-
fessional journals are devoted to this topic (e.g., December 1985 issue of the Journal
of Consulting and Clinical Psychology and the School Psychology Review, 1984, 13,
[No. 4]), and there are entire journals that focus on computers and psychology (e.g.,
Computers in Human Behavior). At the same time, this is a relatively new field, and
many issues have not yet been explored in depth.

HISTORICAL PERSPECTIVE
of MMPI profiles; a clinician testing a patient
Computers have been involved in some phase of with the MMPI could code the resulting pro-
psychological testing ever since the mid 1950s file and look up in the atlas similarly coded pro-
when computer centers were established on uni- files together with a clinical description of those
versity campuses. One of the first uses of campus clients. This was the beginning of the actuar-
computers was to score tests that previously had ial method of test prediction – interpretation of
been hand scored or mechanically scored. the meaning of a test score based upon empir-
Although a wide variety of tests were involved ical relationships rather than clinical subjective
in this phase, from achievement tests to personal- judgment.
ity inventories, much of the early impetus focused The first computer-assisted psychological test-
on the MMPI. ing program was used in the early 1960s at the
A second area of computer and testing inter- Mayo Clinic in Minnesota. The clinic had a
face involved the direct administration of the large number of patients but a small psychology
test by computer. Many tests are now available staff. There was a strong need for a rapid and
for computerized administration, but here too, efficient screening procedure to determine the
much of the pioneer work involved the MMPI. nature and extent of psychiatric problems and
A third area involved the increased use of symptoms. A solution was the administration
computers to provide not just test scores but of the MMPI using IBM cards, which could be
actual test interpretation; this function, too, was read by a computer as answer sheets. A program
enhanced by the work of a number of psy- was written that scored 14 MMPI scales, changed
chologists at the University of Minnesota, who the raw scores to standard scores, and printed a
had been developing and refining the MMPI. series of descriptive statements, depending on the
One such event was the publication of an atlas patient’s scores. These statements were the kind

460
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 461

of elementary statements that a clinician might (1) They can store large amounts of information
make (e.g., a score of 70 or above on scale X sug- such as normative data or “banks” of items; and
gests a great degree of anxiety). This program (2) they can control other pieces of equipment
was simple, ignored configural patterns such as such as optical scanners, videodisc presentations,
specific combinations of scales, but worked rel- and so on.
atively well (R. D. Fowler, 1967, 1985). Similar
work on the MMPI was also going on at the Insti-
The Mayo Clinic MMPI program. The computer
tute of Living in Hartford, Connecticut (Glueck
program initially developed at the Mayo Clinic
& Reznikoff, 1965).
consisted of a scoring portion and an interpre-
A number of other computer-based test inter-
tation portion. The scoring simply tabulated the
pretation systems were then developed, each one
responses according to keyed responses for each
more sophisticated than the earlier one, using
scale. The interpretive portion consisted of sim-
data based on configural patterns, special scales
ple narrative statements associated with specific
and indices, taking into account whether the
scores. For example, a raw score between 11 and
client is a psychiatric patient or a college stu-
14 on the Paranoia scale was associated with the
dent, and other aspects. Basically the procedure
statement “sensitive, alive to opinions of others.”
remains the same: There is a library of descrip-
A score of 15 to 19 elicited the statement, “touchy,
tive statements that are matched to particular
overly responsive to opinions of others. Inclined
scores or combinations of scores. By the 1980s,
to blame others for own difficulties.” A score of
there were commercially available interpretive
20 or more elicited the statement “Resentful and
programs for a variety of personality invento-
suspicious of others, perhaps to the point of fixed
ries, the MMPI, the CPI, the Millon CMI, the
false beliefs” (Kleinmuntz, 1977).
Cattell 16 PF and others, but for practically
In addition to some 49 such descriptive state-
all of them, no information was available as
ments, there were also a number of statements
to how the computer program had been devel-
related to specific profile patterns. For example,
oped and whether the resulting descriptions were
if the person was older than 69 and had a hypo-
valid.
mania score of less than 15, instead of printing the
The entire process of testing can now be auto-
statement associated with a low hypomania score,
mated, from initial instructions, presentation of
the computer would print “low energy and moti-
test items, scoring and transformation of raw
vation typical for age” (Hartman, 1986; Rome,
scores, to the final narrative interpretation, and
et al., 1962).
it will probably be automated in the future.
Currently, a number of other areas of inter-
facing are being explored, limited only by our
COMPUTER SCORING OF TESTS
creativity and ingenuity. One is the assessment of
reaction time, another the use of test-item for- In general, there is relatively little controversy
mats not found on the traditional paper-and- about the scoring of tests by computer. For many
pencil (p-p) test, such as interactive graphics, tests, the scoring can be done locally, sometimes
movement, and speech – all possible on a com- using “generic” answer sheets that, for example,
puter screen. can be read by a Scantron machine. For other
tests, the scoring sheets must be returned, either
Computer functions. Recapping briefly, three electronically or by mail, to the original publisher,
major roles that computers play in psycholog- or to another scoring service. Clearly, at least in
ical testing are: (1) They can quickly and effi- objective-test items, the scoring reliability is per-
ciently score tests and transform the raw scores fect because computers do not get tired or inat-
into various forms, such as percentiles, T scores, tentive, as humans do.
etc.; (2) They can present and administer the test
directly; (3) They can be programmed to generate Optical scanners. Many tests can be scored using
interpretations of the test results. optical scanners that detect and track pencil
Two additional functions that also have marks. These marks are translated into data, the
an impact on psychological testing, namely: data is stored, and it can then be analyzed with the
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

462 Part Five. Challenges to Testing

appropriate software programs. The scanner can trations, the software “expires” and new software
also be combined directly with a microcomputer must be purchased. This type of service is in many
and a printer, so that the sequence from scor- ways the most flexible and least time consuming,
ing to output is more fully automated. We can but may be too expensive for the individual prac-
also entirely eliminate the paper-and-pencil for- titioner, or may not be available for many tests.
mat and place the entire test and its scoring key 2. Terminal to mainframe computer. Here the
directly into the computer. test is administered through a local terminal
(microcomputer), the data is then transmitted
Configural scoring. In this book, we have con-
to a central-location computer by phone lines or
sidered test items as having a keyed response, electronic means, and the test results are returned
and each response to an item is independent, yet to the local terminal and printed there. This elim-
added to responses on other items to obtain a inates the need for a separate answer sheet, and
total score on a scale. Other possibilities exist, results are available almost immediately.
and one of these is configural scoring, with the 3. Remote central location. Here the test is
simplest case based on two items. Consider the administered in the traditional paper-and-pencil
following true-false items: (p-p) mode, and the special answer sheet is
mailed to a central location such as the test pub-
1. I like vanilla ice cream better than chocolate lisher, where the test is processed. The scored
ice cream. sheet and/or test results are then sent back to the
2. Most people are dishonest if they can get away original administrator, by mail or fax. Currently,
with it. this is probably the most common yet most anti-
quated method; the main problem is the time
Considered as a unit, there are four possible
lag.
response patterns to these items: both items can
be answered as true, both as false, the first one Essentially, the administration of a test on a
as true and the second as false, and vice versa. computer terminal is not very different from
Let’s assume that empirically we find that 80% a paper-pencil standard presentation. Usually,
of student leaders give a false-false response, there are instructions presented and the exam-
whereas only 20% of student nonleaders do so. inee must indicate understanding by pressing a
We now have the possibility of scoring the two particular key. Often there are demonstration or
items as a pattern indicative of some real-life practice items, with feedback as to whether the
behavior (assuming we rule out chance and other answer given is correct or not. Responses can
possibilities). Essentially that is what configural be given through a keyboard, a clicking device
or pattern scoring is all about, and the com- (mouse), a light pen, by touching the screen man-
puter can be extremely useful in carrying out ually, and in the future, by voice response. Other
such calculations, which might be prohibitive possibilities are technically feasible and limited
if done by hand. For a good example of a only by our imagination. For example, S. L. Wil-
review of a computer scoring program (in this son, Thompson, and Wylie (1982) used a dental
case for the Luria-Nebraska Neuropsychological plate activated by the tongue to permit hand-
Battery), see Hampton, 1986. icapped individuals who lack the capacity for
speech or the use of limbs, to select from five
distinct responses per item. For an example of an
COMPUTER ADMINISTRATION OF TESTS early automated system to administer the WAIS,
see Elwood and Griffin, (1972).
Type of service. Generally speaking there are
One major advantage of the computer in test
three types of service currently available:
administration is that the computer demands an
1. Local processing. Here a test is administered active response to the instructions and to the
locally, often with software that can score and practice problems. It is therefore not possible
interpret the test. Such software is typically to begin testing until there is evidence that the
obtained from the test publisher, and it may have respondent has mastered the instructions and
a built-in limit. For example, after 25 adminis- understands what is to be done.
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 463

Paper-and-Pencil (p-p) vs. Computer test was given to a large number of Navy recruits,
Format (cf) either p-p or cf. The mean score on the cf was
significantly lower by 1.04 raw-score points –
When we take a p-p test and administer it
but is such a difference of practical import?
through a computer, we haven’t simply changed
Equivalence can also be defined more broadly
the medium, but we have potentially changed the
as equivalence in construct validity – specifically
test. Thus, we cannot assume that existing norms
equivalence in factor structure, and equivalence
and evidence of the reliability and validity of the
of factor loadings. W. C. King and Miles (1995)
p-p format apply automatically to the cf; equiv-
assessed this type of equivalence in four noncog-
alence of the two needs to be established.
nitive instruments and found that administration
There are at least four potential ways in which
mode had no effect on equivalence.
a cf test can differ from the original p-p version:
1. The method of presentation: for example, a Theoretical aspects. From a theoretical point of
graph may be more readable and clearer on the view, there are two major approaches to assess-
original p-p version than on the cf version. ing the equivalence of p-p and cf versions (F. R.
2. The requirements of the task: for example, on a Wilson, Genco, & Yager, 1985):
p-p version the examinee can go back and review
earlier items and answers; on the cf version this 1. Classical test theory. Here we wish to show
may not be possible. that the two versions yield equal means and
3. The method of responding: for example, the variances, and that the pattern of correlations
p-p version of the ACL requires the examinee to with other measures, as in criterion validity, is
check those items that are self-descriptive and essentially identical. We may also wish to assess
items that are not are left blank. The computer the convergent-discriminant validity of the two
version cannot accept a blank response, and each forms.
item must be responded to as self-descriptive or 2. Generalizability theory. Here we wish to ask
not. the questions whether obtained results are gener-
4. The method of interpretation: for example, alizable across different conditions; i.e., if we wish
norms available for a p-p version may not be fully to diagnose an individual as having a particular
applicable to the cf version (Noonan & Sarvela, psychiatric syndrome, it should not make any dif-
1991). ference whether the MMPI was administered as
a cf or as a p-p instrument, and what is more
important, we can identify various sources of
Evidence for equivalence. Equivalence is potential variability. (For an example that com-
present if: (1) The rank order of scores on the bines the two approaches, see F. R. Wilson, Genco,
two versions closely approximate each other; and & Yager, 1985.)
(2) The means, variability, and shape of score
distribution are approximately the same, or have
been made the same statistically. Empirically, Issues of test construction. Although it would
when p-p tests are changed to cf, two types of seem an easy matter to change a p-p test into a cf,
evidence are usually investigated to show that the there are a number of issues, many related to the
two forms are equivalent: (1) The means under question of equivalence between the two forms
the two conditions should be the same; and (2) (see Noonan & Sarvela, 1991).
The inter-correlations, such as among subtests, On a paper-and-pencil test, the instructions
should also be the same. Occasionally, a third and sample items are typically presented at the
type of evidence is presented – the correlational beginning, and the examinee can go back to them
pattern between cf version and relevant criteria at any time. On a computer, these options have
are of the same magnitude as the p-p version. to be explicitly programmed.
Sometimes, it is somewhat difficult to determine On a paper-and-pencil format, there are a
whether there is equivalence or not because number of ways to indicate a response, but most
we would expect some random fluctuations in involve a simple mark on an answer sheet. On the
means. For example, an arithmetic reasoning computer, responses can be indicated by pressing
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

464 Part Five. Challenges to Testing

a key, clicking a mouse, touching the screen, and psychotic, but a detailed analysis of the strengths
so on. After the response is entered, the computer and difficulties the person experiences. In mas-
can be programmed to present the next item or tery testing, we are usually more interested in
to present an option such as, “are you sure? yes global scores, such as knowing that this exam-
or no.” This would require a double keystroke inee scored at the 87th percentile on mastery of
response for every item, which may safeguard elementary algebra.
against “accidental” responses, but prolongs test- The type of test item also interacts with the
ing time and may, for some examinees, become computer format. For now, computers are quite
an aversive feature. adept at using selected-response items (such as
When students change an answer on a p-p test, multiple choice and T-F), but are much less able
research seems to suggest that it is more likely to cope with constructed-response items such as
that the change involves going from an incorrect essays.
answer to a correct one rather than the other way The size of the item pool is also something to
(L. T. Benjamin, 1984). Should a computerized consider. At least theoretically, computers could
test then be programmed so that the respondent handle any number of items, but some types of
is allowed to return to an earlier item? In adaptive tests such as diagnostic tests, or some procedures
testing, this would create difficulties. such as branching would generally require larger
On a p-p test, examinees often will answer easy item pools.
items, skip the more difficult items, and then go For the present, most efforts that have become
back to answer the more difficult ones. To pro- applied involve true-false and/or multiple-choice
gram these options into a computer is quite diffi- items, and we have not as yet taken advantage of
cult, and requires a certain amount of interaction the computer’s capability of presenting visual and
between examinee and machine that may alter the changing stimuli. No doubt in the future, we will
nature of the test. have tests of mechanical aptitude for example,
At present, computer screens cannot comfort- which might present moving machines that need
ably present a full page of writing. If the test to be altered in some way, automobile engines
item is a matching exercise, this could present that need to be fixed, and so on. Sound and ani-
difficulties. mation may well be an integral part of future cf
The nature of the test, whether it is a diagnostic tests.
test or a mastery test, interacts with the nature of General findings. Most studies of the compa-
the computer to create some challenges. Diag- rability or equivalence of p-p tests with their cf
nostic or placement tests, when implemented versions indicate a great degree of comparabil-
on a computer system, ordinarily use branching ity with such tests as the MMPI, CPI, Strong-
(see discussion that follows), where the testing Campbell Interest Inventory, and others in a wide
sequence is directly related to the responses given variety of samples such as gifted children, chil-
on earlier items. In a mastery test, such as an dren and adolescents, college students, and geri-
achievement or “classroom” type test, items are atric patients (e.g., Carr, Wilson, Ghosh, et al.,
usually presented sequentially, although branch- 1982; Finger & Ones, 1999; L. Katz & Dalby,
ing could also take place. 1981; Scissons, 1976; Simola & Holden, 1992;
When to discontinue testing also can differ Vansickle & Kapes, 1993). Despite this, the results
between diagnostic or mastery testing. In a diag- should not be taken as a blanket assumption of
nostic test, failure at one level might move the equivalence. Schuldberg (1988) pointed out that
examinee to less difficult material. In a mastery most previous research used statistical designs
test, testing is stopped once the examinee has that were not very sensitive to possible differences
answered a minimal number of items either cor- between p-p and cf administrations; however,
rectly or incorrectly (you will recall our discus- using such sensitive statistical analyses, he found
sion of basal and ceiling levels with cognitive tests few differences or differences that were small in
in Chapter 5). magnitude on the two versions of the MMPI.
Finally, in diagnostic tests we are typically There may well be differences in various sub-
interested in obtaining a full picture of the indi- groups of individuals, and the computerized
vidual, not just a diagnosis that this person is procedure may well interact with personality,
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 465

attitudinal, or psychopathological variables. For Equivalence of reliability. Few studies have


example, Finegan and Allen (1994) compared addressed this issue. D. L. Elwood and Griffin
a variety of cf questionnaires, such as atti- (1972) compared the p-p and the cf version of
tude and personality scales, with their p-p ver- the WAIS through a test-retest design, and found
sions. Basically the two modes of presenta- the results to be virtually identical.
tion were highly correlated in their Canadian
college-student subjects, but there was some evi- Equivalence of speeded tests. Many clerical
dence that computer administration increased and perceptual tests are highly speeded. They
slightly socially desirable responding among sub- consist of items that are very easy, such as cross-
jects with little computer experience. Waring, ing out all the letter e’s in a page of writing, but
Farthing, and Kidder-Ashley (1999) found that the score reflects the ability to perform the task
impulsive students answered multiple choice rapidly; if sufficient time were given, all exam-
questions more quickly and less accurately than inees would obtain perfect scores. With such
reflective students. tests, there are a number of potential differences
In fact, a handful of studies that have incor- between the p-p version and the cf version. In
porated other variables into their equivalence general, marking a bubble on an answer sheet
design, suggest that cf result in lesser social takes considerably longer than pressing a com-
desirability, more extreme responses, more self- puter key. Several studies do indicate that with
disclosure, more interest, and greater awareness speed tests, subjects are much faster in the cf ver-
of thoughts and feelings. Other studies do report sion, that the reliability of scores obtained on the
greater social desirability and reduction in can- cf presentation is as high as that of the p-p version,
dor (King & Miles, 1995). Clearly, this is not a and that despite the differences in mean perfor-
closed issue. mance, the correlation between the two forms
In addition, much of the research reports the (reflecting the rank order of individuals) can be
equivalence results in terms of correlation coef- quite high (e.g., Greaud & Green, 1986; Lansman,
ficients between the p-p version scores and the cf Donaldson, Hunt, et al., 1982).
version scores. Keep in mind that a high correla- One possible advantage of the cf is that a fixed
tion coefficient can be obtained even if the scores number of items can be administered, and time
on one format are consistently higher or lower elapsed can be tracked simultaneously with num-
than on the other format. ber of items attempted. This is much more dif-
At the same time, if a cf version of a test is ficult to do with p-p tests. Whether such infor-
not equivalent to the p-p version, that does not mation can be used in a predictive sense (e.g., to
mean that the cf version is useless and should be predict job performance) remains to be seen.
thrown out. It may well be that the cf version is Even with tests that are not highly speeded,
more reliable and valid than the p-p format, but equivalence can be problematic if speed is
this needs to be determined empirically. involved. For example, Van de Vijver and
Some studies have not found equivalence. Harsveld (1994) analyzed the equivalence of the
For example, in a study of the Raven’s Pro- General Aptitude Test Battery as administered
gressive Matrices, the scores on the cf version to Dutch applicants to a military academy. You
were significantly different from the p-p version, recall from Chapter 14 that the GATB is a gen-
and the authors concluded that separate norms eral intelligence speed test that uses a multiple-
were needed (Watts, Baddeley, & Williams, choice format and is composed of seven subtests,
1982). such as vocabulary and form matching. Differ-
One potential difference between p-p and cf is ences between the p-p and cf versions were “small
that more items are endorsed on the cf version. though noticeable,” with the cf subtests produc-
One hypothesized explanation is that individu- ing faster and more inaccurate responses, with
als “open up” more to a computer than to a live simple clerical tasks affected more than complex
examiner. Another possibility is that the cf ver- tasks.
sion requires a response to be given to every item, A meta-analysis of the equivalence of p-p ver-
whereas on the p-p version, items can be left blank sus cf versions of cognitive ability tests indicated a
or skipped. significant difference between speeded tests and
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

466 Part Five. Challenges to Testing

power tests. For power tests, the typical corre- There is however, the phenomenon of “com-
lation was .97, whereas for speeded tests it was puter anxiety,” an emotional fear or discomfort
.72. Keep in mind that the definitions of speeded that some individuals experience when interact-
and of power tests, such as the one we gave in ing with computers. Studies suggest that female
Chapter 1, are “idealized” types; most tests are college undergraduates are more anxious than
somewhere in between. Many power tests, such males, older people are more anxious, and that
as the SAT and the GRE, do in fact have significant computer anxiety is inversely related to computer
time limits. experience (Chua, Chen, & Wong, 1999).

Equivalence of checklists. Checklists may Preference of p-p or cf. In most studies where
present problems also. On a p-p version, the this is assessed, subjects typically prefer the cf
respondent checks those items that are pertinent over the p-p format. For example, in one study
and leaves the rest blank. The cf forces the of the MMPI, 77 of 99 subjects preferred the cf
respondent to read all items and respond to each (D. M. White, Clements, & Fowler, 1986). Most
one as, “Yes” or “No.” On the ACL, for example, of these studies however, use American-college
people tend to select many more adjectives as students, who are fairly sophisticated computer
self-descriptive when they respond to the cf users.
version, and most of the additional adjectives
checked are favorable ones (B. F. Green, 1991). Test preparation. As part of test administration
by computer we might consider the possible role
of the computer as a mentor. H. V. Knight, Acosta,
Equivalence of computers. One of the issues
and Anderson (1988) compared the use of micro-
that has not been fully explored is the equivalence
computers vs. textbook materials as aids in the
of different computers, specifically the degree of
preparation of high-school students to take the
resolution of the screen (i.e., how clear the image
ACT (a college entrance examination along
is on the monitor) and the range of color avail-
the lines of the SAT). They found that students in
able. The same test may not be the same when
the computer group scored significantly higher
presented on two different computer screens, and
on the composite and math subtests but not on
this needs to be assessed.
the science subtest.

Report of results. When a test is administered Issue of disclosure. One of the earliest studies
through computer, it is relatively easy to pro- on the effect of computer administration on sub-
gram the computer to calculate the raw scores jects’ responses involved patients at an alcohol-
and to change such scores into standard scores or abuse treatment clinic who tended to report
derived scores, such as T scores and percentiles. greater amounts of alcohol consumption in a
The computer is an ideal instrument to carry out computer-administered interview than in face-
such calculations based on extensive normative to-face psychiatric interviews (Lucas, Mullin,
data that can be programmed. Luna, et al., 1977). The results of this study were
widely interpreted as an indication that sub-
Do examinees like the computer? Initially, jects will be more frank and disclosing, espe-
there was a great deal of concern that comput- cially about personal and negative matters, in the
ers might be seen as impersonal and cold, and more impersonal computer-administered situa-
that examinees would respond negatively to the tion. Subsequent studies supported this interpre-
experience, and perhaps alter their test answers. tation, but some did not (e.g., Carr, Ghosh, &
In fact, the research indicates that in most cases, Ancil, 1983; T. Reich, et al., 1975; Skinner & Allen,
if not all, examinees like computer testing. These 1983).
studies have obtained similar results with a vari-
ety of clients and a variety of tests (e.g., M. J. Issues of test security. Although computer test-
Burke, Normand, & Raju, 1987; Klingler, John- ing would seem to provide greater security
son, & Williams, 1976; N. C. Moore, Summer, & than the traditional format of p-p tests, there
Bloor, 1984; F. L. Schmidt, Urry, & Gugel, 1978). are a number of issues that need to be faced.
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 467

Achievement tests such as course examinations 2. Screening reports. These are somewhat more
that are available on computer illustrate some “complicated” in that the computer narrative
issues. Should such exams be made available at reflects scale relationships rather than the indi-
any time for the student to take? What kind of vidual scales taken one at a time. Thus a descrip-
monitoring, if any, is needed to prevent cheat- tive statement tied to a particular high score on
ing or misuse of the system? Should examinees a depression scale might appear only if a second
be allowed to “preview” a test or only some scale is also high; if the second scale score is low,
examples? Should the examinee be given feed- then a different statement might be printed out.
back at the end of the test; if so, what kind? For 3. Consultative reports. These are the most com-
example, should the feedback simply include the plex and look very much like the report a clinician
score with some normative interpretation (“your might write on a client who has been assessed.
score of 52 equals a B+”) or more detailed feed- The intent of these reports is to provide a detailed
back as to which items were missed, and so on.? analysis of the test data, using professional lan-
guage, and typically written for professional col-
leagues. This type of report is analagous to having
COMPUTER-BASED TEST
a consultation with an expert on that test, a con-
INTERPRETATIONS (CBTI)
sultation that would ordinarily not be available
We now come to the third area of computer to the typical practitioner. This type of report
usage, where the test results are interpreted by is produced by scoring services for multivariate
the computer. Obviously, it is not the computer instruments such as the MMPI and the CPI. To
that does the interpretation. The computer is see what CBTI reports look like for a wide variety
programmed to produce such an interpretation of tests, see Krug (1987).
based on the test results. Such CBTIs are commer-
cially available on a significant number of tests,
especially personality inventories; for example, Actuarial vs clinical. In terms of development,
on the MMPI there are at least 14 such computer- there are basically two methods at present, by
based scoring systems (Eyde, Kowal, & Fishburne, which such computer software is developed. One
1991). is the actuarial method, and the other is the clin-
ical method.
Types of CBTIs. CBTI systems can be charac- 1. The actuarial method. Historically, this
terized along two dimensions – the amount of approach was given great impetus by P. E. Meehl,
information they provide and the method used a Minnesota psychologist who in various publi-
to develop the program (Moreland, 1991). cations (e.g., Meehl 1954; 1956) argued that test
In terms of amount of information, reports results could be automated into standard descrip-
can vary substantially, from a simple presenta- tions by the use of a computer. In 1956, Meehl
tion of scores, to graphs, information on what called for a good “cookbook” for test interpre-
the scores mean, and interpretive and integrated tation, just as a cook “creates” a dish by follow-
descriptions of the client. For our purposes, we ing a set of instructions. The idea was simply to
can distinguish at least three levels: determine empirically the relationship between
test scores and nontest criteria and to use this
1. Descriptive reports. In these reports each of actuarial data to make predictions for a specific
the test scales is interpreted individually without client. Thus, if on test X of depression, we find that
reference to the other scales, and the comments 85% of clients who score above 70 attempt suicide
made are directly reflective of the empirical data within a year, we can now predict that Mr. Jones,
and are usually fairly brief. For example, a high who scored 78, will most likely attempt suicide.
score on a depression scale might be interpreted There are in fact a few examples of this approach
as, “Mr. Jones reports that he is very depressed.” in the literature, most notably on the Personal-
These reports, though limited, can be useful when ity Inventory for Children (Lachar & Gdowski,
a multivariate instrument (such as the MMPI) 1979).
has many scales or when there are many protocols As one might expect, there have been several
to be processed. attempts to produce such actuarial cookbooks for
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

468 Part Five. Challenges to Testing

the MMPI (e.g., Drake & Oetting, 1959; Gilber- Automated vs. actuarial. Note that automated
stadt & Duker, 1965; P. A. Marks & Seeman, or computerized test interpretation is not equiv-
1963), but these systems have failed when applied alent to actuarial. Meehl (1954) specified that
to other samples than those originally used. One an actuarial method must be prespecified; that
of the major problems is that a large number of is, it must follow a set procedure and be based
MMPI profiles cannot be classified following the on empirically established relations. A comput-
complicated rules that such MMPI cookbooks erized test interpretation may simply model or
require for actuarial interpretation. mimic subjective clinical judgment, or it may
2. The clinical method. The clinical method actually be based on actuarial relationships. As a
involves one (or more) experts who certainly matter of fact, very few computerized-report pro-
would use the results of the research literature grams are based solely on actuarial relationships.
and actuarial data and basically combine these Why is that? In large part because, at present,
data with his or her own acuity and clinical skills actuarial rules on tests such as the MMPI would
to generate a library of potential interpretations tend to classify few protocols at best.
associated with different scale scores and com-
binations. A variety of such clinical computer-
Study the clinician. Because we basically want
based interpretive systems are now available for
the computer to do what an expert clinician does,
a wide variety of tests, but primarily for the
we can start by studying expert clinicians. One
MMPI.
example is the work of Kleinmuntz (1969) who
One argument present in the literature (e.g.,
asked several experienced MMPI interpreters to
Matarazzo, 1986) is that the same test scores, as
sort 126 MMPI profile sheets, which had been
on the MMPI, can be interpreted differently given
previously identified as belonging to adjusted or
different demographic factors. For example, a
maladjusted college students. The expert who
high paranoia-scale score would be interpreted
achieved the highest hit rate was then asked to
one way if obtained by a 45-year-old patient with
think aloud as he sorted MMPI profiles. His ver-
schizophrenia, but rather differently if obtained
balizations were tape recorded, and specific deci-
by a 21-year-old college student. The clinician
sion rules were then devised. For example, the
attempts to take into account the unique char-
first rule used by the clinician was that if there
acteristics of the client, whereas the computer
were 4 or more MMPI scales higher than 70, that
can only be programmed in a nomothetic fash-
profile would be called maladjusted. A second
ion, for example, if the client is older than 28, it
rule was that if all the clinical scales were below
interprets this way, if younger, it interprets that
60, if the Hypomania scale (scale 9) was below
way.
80, and Mt (a maladjustment scale) was below a
Unfortunately, the literature suggests that in
raw score of 10, then the profile would be called
attempting to take into account the individual
adjusted (see Kleinmuntz, 1969, for the set of 16
uniqueness of the client, the validity of clinical
rules used). These rules were then programmed
reports generated by clinicians usually decreases,
into a computer to be used to score MMPI pro-
primarily because clinicians are inconsistent in
files. In a follow up study, the computer program
their judgment strategies, whereas computers are
did as well as the best MMPI clinician, and better
consistent. Virtually all the 100 or so studies that
than the average clinician.
compare the clinical vs. the actuarial methods,
show that the actuarial methods are equal to
or exceed the accuracy of the clinical methods. Reliability of CBTIs. In one sense, the reliabil-
The greater the extent to which clinicians rely ity of CBTIs is perfect. If the same responses,
on empirically established methods of data inter- say for the MMPI, are entered and then reen-
pretation and collection, the greater their overall tered, the computer will produce the same exact
accuracy. The computer has tremendous poten- report. This is not the case with a clinician, who
tial to play a major role in increasing the accuracy may interpret the same MMPI profile somewhat
of psychological evaluations and predictions, as differently on different occasions. On the other
the programing becomes more sophisticated in hand, if the same protocol is submitted to several
evaluating an individual’s test scores. scoring services, the result may not be the same.
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 469

However, at present, aside from the MMPI, such Such findings essentially show that there is
choice is not available for most tests. moderate agreement between clinicians – the
clinicians who are asked to judge the CBTI and
The validity of CBTIs. As with the Rorschach, the clinician who originally supplied the test
and indeed with almost every test, we cannot sim- interpretation. Such agreement may well be based
ply ask, “Are CBTIs valid?” We must ask, “Which on clinical “lore”; although the clinicians may
CBTI and for what purpose?” Not many stud- agree with each other, both could be incorrect.
ies are available to answer these questions, but A third set of studies are external criterion
hopefully they will be in the future. Ideally, to studies, where the accuracy of CBTI is matched
evaluate a CBTI, we need to evaluate the library against some criterion. As Moreland (1987)
of item statements and the interpretive rules by points out, these studies have all sorts of prob-
which statements are selected. Very few studies lems, and the results are mixed. However, at least
have been done where experts are asked to look for the MMPI, the results look fairly good, and the
at the interpretive system and evaluate the accu- validity coefficients of computerized reports are
racy of the decision rules that lead to the interpre- fairly comparable with those found in the litera-
tive statements (e.g., Labeck, Johnson, & Harris, ture for conventional MMPI reports. Yet, we must
1983). Such information is typically not avail- consider Faust and Ziskin (1989) who concluded,
able, and thus most studies evaluate the result- “that there is little scientific basis for determining
ing CBTI, either in toto or broken down into the validity of CBTIs.” (For a critical response, see
sections. Brodsky, 1989.)
A number of studies can be subsumed under
the label of “consumer satisfaction,” in that clin- Perceived validity. It is quite possible that simply
icians are asked to rate the accuracy of such seeing that a report is computer generated may
reports, using a wide variety of experimental make that CBTI more acceptable as correct and
approaches, some of which control possible con- objective.
founding effects and some of which do not (see Honaker, Hector, and Harrell (1986) asked
Labeck, Johnson, & Harris, 1983; J. T. Webb, psychology graduate students and practicing psy-
Miller, & Fowler, 1970, for examples). Moreland chologists to rate the accuracy of MMPI inter-
(1987) listed 11 such studies (although many oth- pretive reports that were labeled as generated by
ers are available especially on the MMPI) with either a computer or a licensed clinician. There
accuracy ratings ranging from a low of 32% to a was no difference in accuracy ratings between the
high of 91%, and a median of 78.5%. Although two types of reports. In addition, some reports
this consumer satisfaction approach seems rather contained a purposefully inaccurate statement,
straightforward, there are a number of method- and these reports were rated as less valid. Expe-
ological problems with this approach (see D. K. rienced clinicians tended to perceive reports
Snyder, Widiger, & Hoover, 1990). A general lim- labeled computer generated as less useful and
itation of these studies is that high ratings of less comprehensive than the same reports labeled
satisfaction do not prove validity. clinician generated. Thus, this study failed to
In a second category of studies, clinicians are support the claim that computer-generated inter-
asked to complete a symptom checklist, a Q sort, pretations are assigned more credibility than is
or some other means of capturing their evalua- warranted.
tion, based on a CBTI. These judgments are then L. W. Andrews and Gutkin (1991) asked school
compared to those made by clinicians familiar personnel to rate identical reports that differed
with the client or based on some other criteria, only in terms of authorship – computer vs. school
such as an interview. Moreland (1987) listed five psychologist. The results indicated that author-
such studies, with mean correlations between sets ship had virtually no effect in how the reports
of ratings (based on CBTI vs. based on knowl- were perceived as to overall quality, credibility,
edge of the client) between .22 and .60, with the and diagnostic interpretation.
median of 18 such correlation coefficients as .33.
Such studies now are relatively rare (see More- Usefulness of CBTIs. CBTIs can provide unique
land, 1991, for a review of such studies). quantitative assistance to the test user (Roid,
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

470 Part Five. Challenges to Testing

1985). In particular, CBTI reports can assist in statements that tend to be true of most people
answering four questions: (such as “He has a great deal of potential that
is not well utilized”). The prosecuting attorney
1. How should the results of the test be tempered
reports were essentially identical to the Barnum
in light of the client’s background, demographic
reports but contained a lot of clinical jargon (for
variables such as age, base rates of possible diag-
example, the word “potential” became “libidi-
nostic conditions, and so on. In particular, CBTI
nal energy used in maintaining defenses”). Each
programs can help the clinician become more
student had to select which interpretation was
aware of moderator variables that may affect test
most accurate and which they liked best. In gen-
interpretation for a particular client.
eral, the Barnum interpretation was perceived as
2. How would experts on the test analyze and the most accurate and the prosecuting attorney
interpret the patterns of observed scores, indices, as the least accurate. There was no difference in
etc.? liking between the actual interpretation and the
3. What research findings have implications for Barnum one, but the prosecuting attorney was
this particular client? generally disliked (O’Dell, 1972).
4. How usual or unusual are the client’s scores, It is not surprising that the Barnum report
patterns of diagnostic signs, and so on? was seen as high in accuracy, after all it con-
sisted of statements that are true of almost every-
Potential limitations. Two major potential limi- one. The point is that to judge the accuracy of a
tations of CBTIs have been identified. First there CBTI we need to assess that accuracy against a
is the potential for misuse. CBTIs are widely avail- benchmark – in this case, that obtained through
able, may be used by individuals who do not a Barnum report.
have the appropriate training, or be uncritically A number of studies use real and bogus CBTIs
accepted by professionals. Second, there is the that are rated as to perceived accuracy. The dif-
issue of excessive generality, or what is sometimes ference between average ratings for the real CBTI
called “the Aunt Fanny” report – a report that is so and for the bogus CBTIs expressed as a “percent-
general that its contents could apply to anyone’s age correct” increment is used as the perceived
Aunt Fanny. That is, CBTIs are basically built on discriminant accuracy of the CBTI (Guastello &
modal descriptions or generic types. Rieke, 1990). For example, in one study, under-
graduates completed a personality questionnaire
The Barnum effect. One of the major problems and received a CBTI. Approximately half of the
in trying to study the validity of CBTIs is the Bar- students received a CBTI based on their test
num effect. The “Barnum effect” (named after scores, while half received a bogus CBTI, and both
the P. T. Barnum circus which had a little some- were asked to rate their relative accuracy. The
thing for everyone) refers to the phenomenon real CBTIs were rated as 74.5% accurate while
of accepting a personality interpretation that is the bogus CBTIs were rated as 57.9% accurate.
comprised of vague statements that have a high The authors concluded that the amount of rated
base rate in the population (Meehl, 1956). For a accuracy associated with the Barnum effect was
review of this topic, see C. R. Snyder, Shenkel, & 66.2% (S. J. Guastello, D. D. Guastello, & Craft,
Lowery, 1977. 1989; for other studies, see Baillargeon & Danis,
Here is a typical Barnum-effect study. Col- 1984; S. J. Guastello & Rieke, 1990).
lege students were administered the 16 PF and Obviously, how favorable the statements in the
were subsequently provided with a CBTI. One CBTI are affects the ratings, and this must be con-
sample of students received a CBTI and a trolled experimentally and/or statistically. There
“Barnum” interpretation, while another sam- are also many ways to make up a bogus report.
ple also received a third interpretation labeled a The bogus report may be a randomly selected
“prosecuting attorney” interpretation. The CBTI real report that belongs to someone else, or it
was based on a library of some 1,500 possi- may be made up partially or totally. One can use
ble statements associated with different scores an average test profile as the source for the bogus
on the 16 PF. The Barnum version, identical in report, or one can create a bogus report simply
appearance to the CBTI, was made up of generic by reversing extreme T scores – for example, a T
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 471

score of 30 (two SDs below the mean) would be erized report does not do away with the need for
fed to the computer as a T score of 70 (two SDs a trained clinician.
above the mean). S. J. Guastello & Rieke (1990) 2. Reports may be incorrect because they fail to
had college students take the 16 PF and subse- take into account relevant demographic aspects
quently rate their CBTI, as well as a bogus one such as age or educational background.
based on the average 16 PF class profile. The real 3. Once something is computerized, it often
CBTI were rated as 76.3% accurate overall, while becomes “respected” even though it was not val-
the bogus report was rated as 71.7% – thus the idated to begin with.
real reports were judged better than the bogus
4. Computerized interpretation systems may not
report by a very small margin.
be kept current.
5. The validity of computer-generated narratives
CBTI vs. the clinician. As yet, there is little evi- has not been adequately established.
dence of how CBTIs compare to the traditional
report prepared by the individual clinician. In
one study that compared the two, CBTIs were Accuracy of MMPI CBTIs. As we have seen above,
judged substantially superior in writing style, this is a complex issue. One aspect is that the CBTI
accuracy, and completeness (Klingler, Miller, may not necessarily identify the major themes
Johnson, et al., 1977). In another study, both pertinent to a specific client. A good illustration
types of reports were judged to be mediocre in is a study cited by Schoenfeldt (1989) in which
accuracy and usefulness (Rubenzer, 1992). How- an MMPI protocol was submitted to four dif-
ever, most of the studies in this area involve only ferent companies that provide computer-based
one case and are anecdotal in nature (see More- interpretations. The results were compared with
land, 1991). It can also be argued that a clinician’s a blind analysis of the MMPI profile by a clin-
report is not a good criterion against which to ician and with information obtained during a
judge the validity of the CBTI. clinical interview. The four computer narratives
failed to identify several important problems of
Test X vs. Test Y. It would seem natural to com- the client that included depression, alcohol abuse,
pare the relative accuracy and usefulness of dif- and assaultive behavior.
ferent tests and their CBTI, but almost no stud-
ies exist on this. C. J. Green (1982) compared The Millon Clinical Multiaxial Inventory (MCMI).
two CBTIs on the MMPI with the CBTI on the Eight clinical psychologists rated the accuracy of
Millon Clinical Multiaxial Inventory (MCMI). CBTI for the MCMI. For each client a clinician
Twenty-three clinicians, such as psychiatrists, received two reports: one generated by the com-
social workers, and clinical psychologists, rated puter and one generated randomly, and these
the CBTIs for 100 of their patients, on the basis of reports were rated as accurate or inaccurate for
how adequate was the information, how accurate, each of the seven sections of the report. Percent-
and how useful (i.e., how well organized) was the age of accuracy ranged from 62% to 78% (median
report. One of the two MMPI report systems was of 73%) for the real report and from 32% to 64%
judged as substantially less satisfactory on almost (median of 39%) for the random report, with
all aspects, and the MCMI was judged more accu- two sections of the report judged as no more
rate on interpersonal attitudes, personality traits accurate than those of the randomly generated
and behaviors, self-images, and styles of coping. report. Thus, although the results supported the
judged accuracy of the MCMI, this study points
out the need for a “control” group to determine
SOME SPECIFIC TESTS whether the obtained results yield findings over
Criticisms of MMPI CBTIs. Butcher (1978) listed and above those generated by random procedures
five criticisms of CBTI for the MMPI: (Moreland & Onstad, 1987; Piersma, 1987).

1. Computerized reports are not an adequate The Marital Satisfaction Inventory (MSI). For
substitute for clinical judgment, and a comput- each of the MSI scales, low, moderate, and high,
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

472 Part Five. Challenges to Testing

ranges of scores have been determined empiri- rate such corrective data in our computerized
cally. These ranges differ from scale to scale, so decision systems. There are other factors that
that a clinician would need to know that, for complicate interpretation of neuropsychological
example, a T score of 65 on one scale might well test results, such as low intelligence. A clinician
represent a moderate score, while on another can consider the unique combination of vari-
scale it would represent a high score. The test ables present in a client but, for now, a computer
author (D. K. Snyder, 1981) developed a CBTI cannot.
composed of more than 300 interpretive para- 4. Finally, the use of such computer programs
graphs based on individual scale elevations as well can be considered premature and possibly uneth-
as configural patterns, both within and across ical at this time.
spouses. (For an interesting case study and an
example of a CBTI, see D. K. Snyder, Lachar, &
Wills, 1988.) Test scoring. Many neuropsychological tests are
quite simple to score in that they require a sim-
Neuropsychological Testing ple correct-not correct decision or time taken to
and Computers perform a certain task, and so on. Thus com-
puter scoring of neuropsychological tests would
Report = work. A neuropsychological report not necessarily provide a major advantage.
represents a great deal of professional time, For some tests, there are scoring corrections
and so it would be most advantageous if such to be made depending upon the subject’s age,
reports could be computer generated. Also there education, or other variable – for example, add 3
is research that indicates a greater diagnostic points if the subject is older than 65. Such scoring
accuracy for mechanical/statistical methods of corrections are usually easy to calculate, but for
data combination vs. human clinical judgment, many tests such corrections are simply not avail-
and so the diagnostic accuracy of a well-designed able as part of the normative data. One major
computer program should exceed that of exception is the Luria-Nebraska Neuropsycho-
clinicians. logical Battery, for which computer-scoring pro-
However, at present this potential has not been grams are available and quite useful.
realized. Adams and Heaton (1985; 1987) point One area where computerized scoring could be
out that there are several obstacles to the devel- very helpful is to compare scores across tests. In
opment of computer programs to interpret neu- neuropsychological assessment, a battery like the
ropsychological test results: Halstead-Reitan or the Luria-Nebraska is often
1. There is little uniformity in clinical neuropsy- given in conjunction with other measures, such
chology. The specific tests, the types of interpre- as the WAIS and the MMPI, as well as mea-
tations and recommendations, as well as the type sures of receptive and expressive language skills,
of diagnostic decisions made are so varied, that memory, and so on. Each of these tests has often
computer programs have great difficulty in incor- been normed on different samples, and scores are
porating all these parameters. often not directly comparable from instrument to
instrument. What is needed are computer pro-
2. We have an incomplete state of knowl-
grams that can analyze such disparate data and
edge about brain-behavior relationships. Most
yield comparable scores.
research studies are based on samples that have
well-defined conditions, but in most clinical set-
tings the patients who need a diagnostic workup Test administration. Neuropsychological pati-
have conditions that are not so clearly defined. ents as a group have difficulties with aspects such
3. In normal adults, performance on neuropsy- as following directions and persisting with a task.
chological tests is correlated positively with edu- Also, qualitative observations of the testing are
cation and negatively with age, indicating that very important in the scoring and test interpreta-
such demographic variables are important in tion. Thus, the kind of automation that is possible
neuropsychological test interpretation. We do with a test such as the MMPI may not be feasible
not yet have the adequate norms to incorpo- with neuropsychological testing, more because of
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 473

the nature of such assessment than because of any ADAPTIVE TESTING AND COMPUTERS
test aspects themselves.
The advent of computers into psychological test-
ing has resulted in a number of advances, perhaps
Test interpretation. In general, most attempts to
most visibly in the area of adaptive testing. In a
develop a computer program that would take the
typical test, all examinees are administered the
raw data and interpret the results, focus on three
same set of items. In Chapter 2 we discussed
questions: (1) Is there brain damage present?
the bandwidth-fidelity dilemma. Do we mea-
(2) Where is the damage localized? and (3) Is
sure something with a high degree of precision,
the damage acute or chronic? The results have
but applicable to only some of the clients, or do
been quite limited and disappointing. One study
we measure in a broad perspective, applicable to
found that three computerized scoring systems
most, but with little precision? Adaptive testing
did reasonably well in identifying the presence of
solves that dilemma. In an adaptive test, different
brain damage, but were less accurate in localizing
sets of items are administered to different indi-
such damage (Adams, Kvale, & Keegan, 1984).
viduals depending upon the individual’s status
Such systems, however, were not compared to
on the trait being measured (Meijer & Nering,
the results obtained by clinicians. One study that
1999; D. J. Weiss, 1985). For example, suppose we
compared a computer program with the accuracy
have a 100-item multiple-choice vocabulary test,
of skilled psychologists found that the two experts
with items ranging in difficulty from the word
were more accurate than the computer program
“cat” to the word “gribble.” Rather than starting
in predicting the presence of brain lesions, their
everybody with the word cat, if we are testing a
laterality, but not their chronicity (R. K. Heaton,
college student, we might begin with a word of
Grant, Anthony, et al., 1981).
middle difficulty such as “rickets.” If the person
gets that item correct, the next item to be pre-
The Halstead-Reitan. A number of computer sented would be of slightly higher difficulty. If the
programs have been developed to do an actuarial person answers incorrectly, then the next item to
analysis of the Halstead-Reitan Neuropsycholog- be presented would be of lower difficulty. Thus
ical Test Battery. The early investigations of these the computer calculates for each answered item
systems produced inconsistent results, and it was whether the answer is right or wrong, and what
questioned whether these programs could be as the next appropriate item would be. As the per-
accurate as experienced clinicians in determin- son progresses through the test, the calculations
ing the presence or absence of brain damage and become much more complex because, in effect,
related aspects. Adams, Kvale, and Keegan (1984) the computer must calculate a “running average”
studied a sample of 63 brain-damaged patients plus many other more technical aspects. The per-
on which there was reliable diagnostic informa- son might then experience a slight delay before
tion – what was wrong with the patients was well the next appropriate item is presented. However,
documented. Each patient had taken a number the computer is programmed to calculate the two
of tests such as the WAIS, MMPI, and many of possible outcomes – the answer is right or the
the Halstead-Reitan subtests. The computer pro- answer is wrong – while the person is answering
gram essentially involved a statistical compari- the item. When the answer is given, the computer
son of an index of WAIS subtest scores that pre- selects the pertinent alternative and testing pro-
sumably are resistant to mental deterioration (see ceeds smoothly.
Chapter 5), vs. an index on the Halstead-Reitan There are many synonyms in the literature
presumably showing degree of impairment. Basi- for adaptive testing, including tailored, pro-
cally, the computer program correctly identified grammed, sequential, and individualized testing.
57% of those with left-hemisphere brain damage, Generic adaptive testing is of course efficient and
81% with right-hemisphere damage, and 90% can save considerable testing time. They are also
with diffuse damage. The authors concluded that more precise, and at least potentially, are there-
all three computer programs they tested were fore more reliable and valid.
inadequate as comprehensive neuropsychologi- Adaptive testing actually began with the Binet-
cal report mechanisms. Simon tests and its principles were incorporated
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

474 Part Five. Challenges to Testing

into the Stanford-Binet. Thus, the administrator tered the MMPI-2 twice, 1 week apart. One group
begins the Stanford-Binet not at the same ques- took the standard booklet form, one group the
tion for everybody, but at a point appropriate for standard form and a retest with a computerized
the particular child; with a bright 8-year-old, we adaptive form, and the third group a standard
might start testing at the 9th- or 10th-year level. computerized form and retested with the adap-
As items are administered they are also scored, tive computerized form. The results showed a
so the examiner can determine whether to con- high degree of similarity indicating that the three
tinue, stop, or return to easier items. Currently, forms of the MMPI-2 were quite comparable. The
most adaptive administration strategies are based subjects were also administered a variety of other
on item-response theory and have been primar- scales such as the Beck Depression Inventory; the
ily developed within the areas of achievement results again supported the comparability of the
and ability testing. Item-response theory, how- three forms. The administration of the computer
ever, requires scales of items that are unidimen- adaptive form achieved about a one-third time
sional. Most personality tests such as the MMPI savings.
and the CPI have multidimensional scales.
Item banking. Computerized testing in general,
Branching. A variation of adaptive testing with
and adaptive testing specifically, often require
particular potential in personality assessment is
large item pools. This is an advantage that com-
that of branching. Let’s say we wish to assess
puters have because large item pools can be easily
an individual on more than 25 different areas
stored in a computer’s memory. These item pools
of functioning, and we have some 50 items per
are referred to as item “banks” (for a view of item
area. In the area of depression, for example, we
banking in Holland and in Italy, see LecLercq &
might have items about hopelessness, lack of
Bruno, 1993). The computer allows subsets of
libido, loss of weight, suicidal ideation, and so
items with specific characteristics (e.g., all items
on. If the responses to the first four or five items
testing material from Chapter 14, or all items on
clearly indicate that for this client suicide is not
scale X, or all items that are empirically correlated
a concern, we could move or branch to the next
with GPA) to be called and selected – these can
area where more items might be appropriately
then be modified and printed or presented as a
administered.
test. CBTIs also require extensive pools of inter-
pretive statements called “libraries.” For exam-
The countdown method. Butcher, Keller, and
ple, one MMPI computer-based test interpreta-
Bacon (1985) have proposed the countdown
tion system contains more than 30,000 sentences
method as a way to adaptively administer the
(cited in Moreland, 1991).
MMPI. Essentially, this method terminates item
A computer with a large item bank can be pro-
administration on a particular scale once suf-
grammed to randomly select a subset of items as
ficient information for that scale is obtained.
a test, so that each person to be tested gets a differ-
Suppose, for example, that we have a 50-item
ent test. This can be useful in academic settings,
scale and that an elevated score (for example,
as well as in situations where retesting might be
a T score of 70 or above) is obtained once the
needed. This should, of course, be distinguished
client answers 20 items in the keyed direction. At
from adaptive testing.
that point, administration of further items could
be terminated. Similarly, if the client answers 31
items in a nonkeyed direction, the test adminis- Purposes of testing. We discussed the basic pur-
tration could also be stopped because scale ele- poses of testing in Chapter 1; these purposes can
vation could not be obtained with the remaining interact with computer methodology. For exam-
19 items. Because degree of elevation might be ple, if a test is to be used for diagnostic or place-
important, we could continue testing in the first ment purposes, then branching could be fully
case but not in the second. used so that the testing can become a teach-
In one study (Roper, Ben-Porath, & Butcher, ing tool. (However, if the basic purpose is to
1995), 571 college students were randomly assess mastery, then such branching may not be
assigned to one of three groups, and adminis- appropriate.)
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 475

Advantages and Disadvantages attempt to do this. Computers can handle exten-


of Computer Use in Testing sive amounts of normative data, but humans are
limited. Computers can use very complex ways
Advantages. In addition to the advantages of combining and scoring data, whereas most
already mentioned above (such as the increased humans are quite limited in these capabilities.
scorer reliability) there are a number of advan- Computers can be programmed so that they con-
tages to using computers in the testing process: tinuously update the norms, predictive regres-
1. Better use of professional time. Paper-and- sion equations, etc., as each new case is entered.
pencil tests are often administered by clinicians 6. Greater standardization. In Chapter 1, we
who could better use their time and expertise saw the importance of standardizing both test
in diagnostic and/or therapeutic activities or in procedures and test interpretations. The com-
conducting the research needed to validate and puter demands a high degree of such standard-
improve tests. Tests on computers allow the use ization and, ordinarily, does not tolerate deviance
of trained “assistant psychometricians” or even from such standardization.
clerks who would not require extensive educa- 7. Greater control. This relates to the previous
tional background or professional training. point, but the issue here is that the error variance
2. Reduced time lag. In most clinical and/or attributable to the examiner is greatly reduced if
client settings there is a serious lag between not totally eliminated.
test administration and availability of results – 8. Greater utility with special clients or groups.
whether we refer to achievement test batteries There are obvious benefits with computerized
given in the primary grades, tests given to psy- testing of special groups, such as the severely dis-
chiatric patients, tests used in an applied settings abled, for whom p-p tests may be quite limited
such as preemployment assessment, or other or inappropriate.
tests. The use of a computer can make “instant” 9. Long-term cost savings. Although the ini-
feedback a reality in most situations. Scores can tial costs of purchasing computer equipment, of
be provided quickly not only to the examinee, developing program software, etc., can be quite
but also to various agencies, for example in the high, once a test is automated it can be adminis-
case of GRE scores that may need to be sent to a tered repeatedly at little extra cost.
dozen or more universities. 10. Easier adaptive testing. This approach
3. Greater availability. Tests on computers can requires a computer and can result in a test that
be administered when needed, with fewer restric- is substantially shorter and, therefore, more eco-
tions than the traditional p-p format. For exam- nomical of time. The test can also be individual-
ple, a student wishing to take the GRE can do so ized for the specific examinee.
only on one of several dates in p-p format, but
computer administration can be scheduled more
frequently. Are advantages really advantages? In read-
4. Greater flexibility. Individuals can be tested ing the above list, you may conclude that some
in a computer setting individually or in groups, advantages may not really be advantages. There
usually in more user-friendly environments than may be some empirical support for this, but rel-
the large classroom-auditoriums where tests such atively little work has been done on these issues.
as the SAT and the GRE have been administered For example, immediate feedback would typi-
traditionally. The computer format is also much cally be seen as desirable. S. L. Wise and L. A.
more flexible than the printed page; for example, Wise (1987) compared a p-p version and two cf
split screens could show stimuli such as a picture, versions of a classroom achievement test with
as well as the possible responses. In addition, the third and fourth graders. One cf version pro-
computer format allows each examinee to work vided immediate item feedback and one did not.
at his or her own pace, much more so than the All three versions were equivalent to each other
p-p version. in mean scores. However, high math-achievers
5. Greater accuracy. Computers can combine a who were administered the cf version with imme-
variety of data according to specific rules; humans diate feedback showed significantly higher state
are less accurate and less consistent when they anxiety; the authors recommended that such
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

476 Part Five. Challenges to Testing

feedback not be used until its effects are better that such CBTIs were valid (e.g., Lanyon, 1984;
understood. Matarazzo, 1983).

There are in fact a number of such ethical-legal


Disadvantages. In addition to the disadvantage
issues (N. W. Walker & Myrick, 1985). One cen-
just discussed, we might consider that comput-
ters on unauthorized access to records and vio-
erized testing reduces the potential for observing
lations of confidentiality. There is the possibil-
the subject’s behavior. As we have seen, one of the
ity that computers promote the indiscriminate
major advantages of tests such as the Stanford-
storage of personal information and may result
Binet is that subjects are presented with a set
in violations of privacy. Equipment does fail,
of standardized stimuli, and the examiner can
and there may be loss of information as a result
observe directly or indirectly the rich individual
of such failures. Many of the issues center on
differences in human behavior that enhance the
CBTIs. Some people may show blind acceptance
more objective interpretation of test results. With
of computer interpretations, although the limited
computerized testing, such behavior observation
data suggest this is not the case. A major unre-
is severely limited.
solved issue is who is responsible for the CBTI?
Is it the individual clinician who submitted the
ETHICAL ISSUES INVOLVING
client’s protocol? Is it the clinician(s) who origi-
COMPUTER USE
nally developed the interpretive program? Or is it
the company who sells the scoring service? These
The use of computers involves a number of issues are not simply “academic” concerns; but they are
that are potentially ethical and/or legal. R. D. voiced in applied settings, such as schools (Jacob
Fowler (1985) identified four basic questions: & Brantley, 1987). Some of the concerns are
somewhat less tangible. For example, Matarazzo
1. Do computers “dehumanize” the assessment (1986) wrote that when the Binet-Simon test was
process? While this seemed to be a major con- first developed, it was developed within a philos-
cern when computers first became available, it ophy that emphasized the worth of the individual;
no longer seems to be an issue. For example, most assessment today is carried out with sensi-
White (cited in Fowler, 1985) found that 80% tive concern for the client but such sensitivity is
of college students preferred taking the MMPI easily neglected with computerized technology.
by computer. J. H. Johnson and Williams (1980)
reported that 46% of their subjects said they were
The CBTI guidelines. It is generally agreed that
more truthful when responding to the computer
computer technology in the area of psycholog-
than to a clinician.
ical testing be applied with the same ethical
2. What are the ethical implications? For exam- and professional standards as traditional tests –
ple, when test scoring and test reporting were first but this may not always happen. At least two
offered by mail, some psychologists believed this states (Colorado and Kansas) have published
might be a violation of the American Psycholog- standards for computerized psychological assess-
ical Association (APA) policy against mail-order ment, and in 1986 the APA developed Guide-
testing. lines for Computer-Based Tests and Interpretations
3. Who is qualified to use CBTI reports? Many (APA, 1986), known as the CBTI Guidelines.
test publishers have restricted the sales of tests to The CBTI Guidelines are an extension of the
qualified purchasers, often broadly defined. Such Standards for Educational and Psychological Test-
policies have, in some cases applied to the sale of ing (1999) and are designed to guide test devel-
computer-generated services. opers to establish and maintain the quality of
4. How valid are CBTIs? Quite clearly the validity their products, and to assist professionals to use
of the CBTI is intrinsically related to the validity CBTI in the best interests of clients and the public
of the test upon which the CBTI is based. If the (Schoenfeldt, 1989).
test has poor validity to begin with, the CBTI will The CBTI Guidelines distinguish four partici-
also have poor validity. In the 1980s, a number pants in the testing process: (1) the test developer,
of writers pointed out that there was no evidence (2) the test user, (3) the test taker, and (4) the test
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 477

administrator. Test users are defined as qualified information that is potentially useful is often mis-
professionals who have (1) knowledge of psycho- used, by disregarding base rates and by reliance
logical measurement, (2) a background in the on subjective rather than objective procedures;
history of the tests being used; (3) experience in and (3) Clinicians, like people in general, have a
the use of the test and familiarity with the asso- limited capacity to manage complex data. These
ciated research; and (4) knowledge of the area of reasons once again point to the future potential
intended application. The CBTI Guidelines rec- of using CBTIs.
ognize that nonpsychologists such as physicians
and lawyers, may have legitimate need to use psy- Computer tests. For now, much effort has been
chological tests and computerized test reports, devoted to “translating” p-p versions to cf ver-
but require that they also have sufficient knowl- sions, and relatively little effort has been devoted
edge to serve the client public. to creating new computer-administered tests. A
There are nine guidelines for test users that number of tests have however been developed
cover two areas: administration and interpreta- specifically for computerized use, and some of
tion. For administration, the main concern is these take advantage of the graphic possibilities of
standardization of procedures – that the condi- the computer (e.g., Davey, Godwin, Mittelholtz,
tions of computerized testing be equivalent to 1997). As one might imagine, psychologists in
those in which normative, reliability, and valid- the military, both in the U.S. and other countries
ity data were obtained. In terms of interpreta- including England, have pioneered many such
tion, the CBTI Guidelines state that, “computer- techniques (B. F. Green, 1988). For example, one
generated interpretive reports should be used technique used with Navy personnel is an “Air
only in conjunction with professional judg- Defense Game” that involves a simulated radar
ment.” Basically, this reflects the APA policy that screen with hostile air targets approaching one’s
computer-based interpretations are considered ship. The examinee has to defend the ship by
professional-to-professional consultations. launching missiles. The effects of stress and other
There are 31 guidelines under the heading of aspects on such “test” performance can be easily
test developer, and these relate primarily to two studied (Greitzer, Hershman, & Kelly, 1981).
issues: (1) the computer-based administration These approaches would seem to be partic-
issues (such as the client having the opportu- ularly useful in the assessment of perceptual-
nity to change a test answer); and (2) psycho- motor coordination and decision-making
metric issues, such as establishing the equiva- aspects, as well as other domains of human
lence between p-p and cf versions of the same test abilities (Fleishman, 1988).
(see Schoenfeldt, 1989). A full copy of the CBTI Barrett, Alexander, Doverspike, et al. (1982)
Guidelines can be found in B. F. Green (1991). reported on a battery of information-processing
tests developed specifically for computer testing
(although p-p versions of these tests are also avail-
OTHER ISSUES AND COMPUTER USE
able). For example, in the Linear Scanning test, 20
Legal issues. Psychologists are often called upon equilateral triangles are presented in a row. Each
as expert witnesses in the courtroom. Unfortu- triangle has a line through it, with the excep-
nately, studies that evaluate the accuracy of diag- tion of one to four triangles. The row of triangles
nosis and prediction show mixed results and cast is presented for 1.5 seconds and erased, and the
substantial doubts as to whether psychological subject needs to indicate how many triangles did
evaluation can meet the legal standards for exper- not have a line through them. The split-half reli-
tise. Ziskin and Faust (1988) cite 1,400 studies ability for most of these measures was above .80
and articles that cast doubt on the reliability and but test-retest reliability with a 2-week to a 1-
validity of clinical evaluations conducted for legal month interval, was rather low, with none of the
purposes. Why is the reliability and validity of 15 reported correlation coefficients higher than
such psychological evaluations so limited? Faust .60, and in most cases substantially lower. To be
and Ziskin (1989) cite several reasons: (1) Psy- sure these measures are more “complex” in struc-
chological theory is not so advanced as to per- ture than the type of items found on personality
mit precise behavioral prediction; (2) Available inventories or traditional cognitive tests, but the
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

478 Part Five. Challenges to Testing

results are disappointing, nevertheless. (See Van computer asks, “To what do you attribute your
den Brink and Schoonman, 1988, for an example success?” Although such an approach has little
of computerized testing in the Dutch Railway per- utility at present (for one, note that only 616 of
sonnel system; and N. Schmitt, Gilliland, Landis, the 83,000 responses can be handled by the com-
et al., 1993, for an example of computerized test- puter), it points to future developments.
ing for the selection of secretarial applicants.)
The Beck Depression Inventory. Steer, Riss-
miller, Ranieri, et al. (1994) administered the BDI
A LOOK AT OTHER TESTS AND
in a cf to 330 inpatients diagnosed with mixed
COMPUTER USE
psychiatric disorders. The coefficient alpha was
The CPI. Despite the fact that the CPI is a well- reported as .92, and the BDI significantly dif-
researched personality inventory, and that its ferentiated patients diagnosed with mood disor-
CBTIs are widely available and used, there is ders from those with other psychiatric disorders.
very little research currently on computerized Scores on the BDI correlated significantly from
applications. Sapinkopf (1978) computerized the pretest to posttest some 9 days later, and corre-
CPI and used an adaptive-testing approach that lated significantly with scores on the Hopeless-
essentially presented 67% fewer items and thus ness scale. Scores were not significantly related to
took less time to administer. However, the relia- gender, ethnicity, or age. Thus the reliability and
bility of this computerized version was lower than validity of the BDI cf seems comparable with that
the p-p version. of the p-p version.
However, in a different study with college stu-
Projective techniques. Attempts have been dents (Lankford, Bell, & Elias, 1994), the results
made to computer score such projective tech- showed that students high on computer anxiety
niques as the Rorschach and the Holtzman scored higher on the BDI cf than p-p format.
Inkblot Test (e.g., Gorham, 1967; Gorham, The authors concluded that the use of computer-
Moseley, & Holtzman, 1968; Piotrowski, 1964), administered personality tests may not be a valid
but in general these do not seem to be widely used procedure in assessing personality dimensions
by clinicians. because: (1) Elevated computer anxiety is related
As you might expect, computer administra- to elevated scores on tests that measure nega-
tion and computer interpretation of projectives tive affect and to lowered scores on tests that
presents major challenges, but an interesting measure positive affect; and (2) scores of female
example comes from a sentence-completion test. subjects on computer-administered tests were
Veldman (1967) administered 36 sentence stems altered more than scores of male subjects, and
to more than 2,300 college freshmen; these stems this was not due simply to higher computer
required a one-word response. He thus obtained a anxiety. Therefore, these authors recommended
response pool of more than 83,000 responses. By that standardized normative distributions may
eliminating all words with frequencies of less than not be applicable to computerized personality
1%, a pool of 616 “common” responses was kept. tests.
Veldman then developed a computer program
that administers each sentence stem and waits Behavioral assessment. Kratochwill, Doll, and
for the subject to respond. If the response is a Dickson (1985) point out that microcomput-
rare word, the computer program requests a syn- ers have, both actually and potentially, revolu-
onym. A second rare response results in another tionized behavioral assessment. First, behavioral
request; if the third response is also rare, then the assessment can be time-and personnel-intensive,
second sentence stem is presented. If the response and microcomputers can reduce the cost asso-
is a common word – that is, one in the computer’s ciated with this. Second, microcomputer tech-
memory – then followup questions are presented. nology can be applied to the full range of
For example, if the sentence stem “My work has behavioral-assessment techniques such as psy-
been ” is responded with the word “hard” or chophysiological recordings, direct observation,
“difficult,” the computer asks, “What do you find and self-monitoring. Third, behavioral assess-
difficult about it?” If the response is “good,” the ment has in the past lacked standardization;
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 479

the use of computers can aid substantially in averaged across students and entered as the
this respect. Fourth, because microcomputers dependent variable in a regression analysis to
are now readily available, the dissemination of determine whether response time is related
behavioral-assessment techniques is facilitated. to item characteristics, such as item length
Finally, microcomputers have the potential to and ambiguity. They found that item length
strengthen the relationship between assessment accounted for about half of the variance, that is,
and treatment. for the subjects as a group, response time was
clearly a function of item length.
Miscellaneous Issues Involved
Additional concerns. As with any technological
in Computer Use
revolution, there are problems and challenges to
The disabled. A number of studies have inves- be met, some of which we have discussed above,
tigated the use of light pens, joysticks, and some of which we have ignored because they
other mechanical-electronic means of respond- would take us too far afield, and some of which we
ing to test items for disabled individuals who are not even as yet aware. J. A. Cummings (1986)
are not able to respond to tests in traditional relates an interesting series of studies done by him
ways. The results suggest substantial equiva- and his colleagues. They wanted to use CBTIs to
lence across response modes (e.g., Carr, Wilson, generate early detection of emotional distress in
Ghosh, Ancill, & Woods, 1982; Ridgway, MacCul- medical patients of primary-care physicians, so
loch, & Mills, 1982). (See S. L. Wilson, 1991, for that the physicians could refer these patients to
a description of a microcomputer-based psycho- psychotherapy. They studied more than 10,000
logical assessment system for use with the severely patients who were seen by 36 primary-care physi-
physically disabled, as used in a British medical cians. All of the patients took a 3-hour auto-
facility, and see S. L. Wilson, Thompson, & Wylie, mated multiphasic health screening that included
1982, for examples of automated psychological a computerized psychological questionnaire. For
testing for the severely physically disabled.) a random half of the patients, the physicians
received a computer printout, with the sugges-
Response time. Computers can easily assess tion (if appropriate) that the patient be referred
response time, that is, how fast a subject responds. for psychotherapy. The results indicated that the
Response time (or reaction time, response computerized report and suggestion elicited no
latency) to questionnaire items could be a useful more referrals than those in the group with no
additional measure in a number of research areas report.
(Ryman, Naitoh, Englund, et al., 1988). Some The authors speculated that perhaps the report
authors, for example, have argued that such laten- did not contain sufficient information to moti-
cies can potentially be indicative of meaning- vate the primary-care physician; they repeated
ful variables. On a personality test, longer laten- the study, this time by providing three types of
cies may reflect more “emotional” items (Space, reports, ranging from a relatively brief descrip-
1981). Response time brings up two issues: tion of the computerized findings to an extensive
(1) Are the individual differences associated with detailed description of the patient’s emotional
such responding related to different behaviors distress and personality. In this case, the report
(e.g., are faster responders more emotional?); and worked. Not only were there more referrals in the
(2) Are items that are responded to more rapidly experimental than the control (no report) group,
different from those that require a longer reaction but as the complexity of the report increased,
time, either within the individual (e.g., If I take the percentage of referrals increased. In a subse-
longer to respond to item X is that item more quent study, increased information in the report
conflict-laden for me?), or within groups (e.g., not only increased the likelihood of referral for
Are items that take longer to respond to more psychotherapy, but also increased the number of
complicated or confusing?) missed medical diagnoses; that is, the symptoms
T. G. Dunn, Lushene, and O’Neil (1972) of the patient were initially ascribed by the physi-
administered a computerized version of the cian to emotional distress, but were subsequently
MMPI to college students. Response times were rediagnosed as a physical illness.
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

480 Part Five. Challenges to Testing

Acceptability to clients. Originally, some psy- be handicapped, because of a stereotype that


chologists feared that computerized testing males are more interested and better at computer-
depersonalized the client. As we have seen, the related activities? Prior familiarity with comput-
literature indicates rather strongly that clients ers can affect performance on at least some tests.
react favorably to computerized testing, that such For example, D. F. Johnson and White (1980)
acceptability increases as the topic of the test administered a computerized version of the Won-
becomes more “sensitive,” and that any anxiety derlic Personnel Inventory (an intelligence test
induced by the computer is usually brief, espe- frequently used in employment screening) to
cially if adequate practice is provided (see M. J. 20 elderly volunteers. Ten of the participants
Burke & Normand, 1987). received 1 hour of training on the computer ter-
minal and 10 did not. Those who received the
Response bias. It is reasonable to suppose that training scored about 5 points higher (approxi-
computer administration of tests might increase mately one SD). However, initial studies indicate
the honesty of response, particularly to “sensi- no differences in performance between boys and
tive” types of questions – i.e., response bias ought girls, between African-Americans and Anglos,
to be reduced. The view here is that acquies- and between those with different levels of prior
cence, social desirability, and so on, operate pri- experience.
marily because the test situation is perceived as
embarrassing and not confidential; thus, a com- Achievement tests. Classroom exams have been
puter would be seen as more anonymous and traditionally administered to the entire class at
confidential. In one study, three groups of col- one sitting, and make-up exams represent a major
lege students were administered questionnaire headache for the instructor because either a new
items, either in a p-p format, an interview for- form has to be created, or the student has to be
mat, or via computer. The questionnaire included prevented from talking with any other student in
embarrassing questions. The computer group the class. With on-line testing, each student can
answered more questions in the keyed direction take a test at a microcomputer terminal, where
(i.e., admitted “true” to more embarrassing ques- the test items may well be unique to that admin-
tions) and scored in the less defensive direction istration, and randomly selected from an item
on the MMPI K scale (although the differences bank.
between groups were not statistically significant
D. Koson, Kitchen, M. Kochen, & Stodolosky, Can the computer generate tests? In some
1970). areas, the answer is, “Yes.” For example, the
sentence verification technique is used to con-
Social desirability. Some studies on the equiv- struct valid tests of language comprehension.
alence of p-p and cf questionnaires have found Traditionally, language comprehension has been
that respondents on the cf admit to more anxiety assessed by presenting a prose passage and then
symptoms, score lower on lie scales, and endorse asking the subject a number of questions, often
fewer socially desirable responses. Other studies in a multiple-choice format. The sentence ver-
however, find that the two modes of adminis- ification technique also presents a passage, but
tration yield similar results as far as social desir- the task requires the subject to recognize rather
ability is concerned (Booth-Kewley, Edwards, & than recall textual information by responding to
Rosenfeld, 1992). sentences that either correspond to the original
meaning of a sentence in the passage, or do not.
Familiarity with computers. Just as in Chapter These sentences could be either identical sen-
16 we were concerned about testwiseness, here tences or paraphrased sentences; in either case
we need to be concerned about computer wise- the keyed response would be, “Yes.” Or the sen-
ness or sophistication. If a child is brought up in tences could be sentences that have been changed
a home environment where a computer is part from the original passage (e.g., “My dog is white”
of the furniture or in a school that uses com- changed to, “My dog is not white”) or sentences
puters extensively, will that child be at an advan- that are related but were not in the original
tage in taking computerized tests? Will women passage; in either case, the keyed answer is “No.”
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

The Role of Computers 481

For an example of computer software that auto- works independently, and it takes about 1 hour
mates and simplifies the process of constructing to complete each case. The test is essentially
such a test, see Walczyk (1993). composed of three different types of items:
(1) multiple-choice questions; (2) open-
questions, where the patient asks a question and
THE FUTURE OF COMPUTERIZED
the examinee types an answer. The program
PSYCHOLOGICAL TESTING
checks the answer with a prerecorded list of
It is probably foolhardy to look into our crystal possible responses; and (3) open-actions. Here
ball and predict what the future of psychologi- the examinee initiates the response, by typing on
cal testing will be like with the computers of the the prescription or consulting with the physician
future. We can however, look at what is avail- (all on the computer screen).
able now and use that as a limited and probably The program is basically a branching program
myopic view of the future. where at each decision point there are different
alternatives. The path taken by each examinee
Constructed responses. At present, computer- can be quite different, but there is in fact an ideal,
ized applications work best when the items most efficient, correct pathway. One interesting
are selected-response items (e.g., multiple-choice aspect of this test is that feedback is provided
items). Constructed-response items, such as both during the test and at the end of the test.
essay exams or sentences given as responses, In fact, each case can be replayed showing the
present much more of a challenge. correct pathway, and providing comments on the
student’s particular responses that deviated from
Voice output. Voice output devices are now this pathway.
available so the computer can “speak” to the Scoring is somewhat complicated to explain,
examinee, and for example, present administra- but very easy for the computer. The scoring sys-
tive instructions. The potential of this for blind tem covers five categories: (1) accepting the pre-
subjects or those with limited reading skills is only scription, (2) handling the computer program,
now beginning to be explored. (3) preparing the medicine, (4) managing the
medical supplies, and (5) delivering the prescrip-
Voice input. Voice analysis indicators are poten- tion. For each of these, there, are two scores: on
tially related to a variety of variables, such as stress effectiveness (i.e., was the problem solved regard-
and deception. By having subjects respond orally less of how the solution was reached) vs. efficiency
rather than by clicking a mouse, such indicators (did the examinee follow the ideal path). Unfor-
could be related to various test characteristics. tunately, for a sample of 143 pharmacy students,
the alpha coefficient was only .58. In part, this is
Interactive video tests. An example of future understandable because both the cases and the
direction is presented by Dutch researchers from scoring procedures are heterogeneous. This is a
their National Institute for Educational Measure- pioneering effort with many challenges yet to be
ment (Bosman, Hoogenboom, & Walpot, 1994), faced, but it is a good illustration of ingenuity
who developed an interactive video test for phar- and what the future might hold.
maceutical chemists’ assistants. The test offers six
cases that simulate real-life situations in a phar-
SUMMARY
macy. The assistant needs to (1) answer questions
of the patient, (2) ask the patient relevant ques- In this chapter, we discussed how computers
tions, (3) handle the administrative-computer interface with psychological tests. Specifically, we
program, (4) make the right choice of medicines looked at computers as test-scoring machines,
written on the prescription, and (5) deliver the as test administrators, and as test interpreters.
medicines with the right information. One of the major issues is that of equivalence:
The six cases were developed to cover a Is the computer version of a paper-and-pencil
wide variety of situations, including dangerous test equivalent to the original test? Another major
interactions of different medicines, specific ways issue is the validity of computer-based test inter-
of handling medicines, and so on. The examinee pretations. The MMPI has played a central role
P1: JZP
0521861810c17 CB1038/Domino 0 521 86181 0 February 24, 2006 9:14

482 Part Five. Challenges to Testing

in the development of computer use in testing, A review and discussion of many of the guidelines included
but other tests are also involved. Neuropsycho- in the APA “Guidelines.”
logical testing presents special challenges and Tennar, B. A. (1993). Computer-aided reporting of the
adaptive testing some potentially novel solutions. results of neuropsychological evaluations of traumatic
Although the advantages of computer use far out- brain injury. Computers in Human Behavior, 9, 51–56.
weigh the disadvantages, there are ethical and Reports on a computer program, the TBI Report Assistant,
other issues that we need to be concerned about. designed to be used with patients known to have suffered
traumatic brain injury. A program that basically describes
a patient’s current performance on several tests, including
SUGGESTED READINGS the WAIS-R and the Wechsler Memory Scale-Revised, and
compares that performance to the patient’s peers.
Fowler, R. D. (1985). Landmarks in computer-assisted
psychological assessment. Journal of Consulting and DISCUSSION QUESTIONS
Clinical Psychology, 53, 748–759.
An interesting review of the history and development of 1. The book discusses three types of service asso-
computer-based test interpretation by a talented psycholo- ciated with computer administration of tests.
gist who was a pioneer in this area. What is available on your college campus or
Klee, S. H., & Garfinkel, B. D. (1983). The computer- community? Do these services deviate from the
ized continuous performance task: A new measure of three categories presented?
inattention. Journal of Abnormal Child Psychology, 11, 2. How might you carry out a study to determine
487–496. whether a personality variable (such as impulsiv-
An example of a study of a test administered by computer, ity, for example), interacts with some aspects of
with such aspects as sensitivity and specificity discussed. computerized presentation of a test?
Matarazzo, J. D. (1986). Computerized clinical psy- 3. To assess the validity of a CBTI one could com-
chological test interpretations. American Psychologist, pare a clinician’s judgment based on the CBTI vs.
41, 14–24. a second clinician’s judgment based on personal
An excellent thought-provoking article about the potential knowledge of the client. What do you think of
and the problems associated with CBTIs, written by one of this procedure? What are some of the potential
the leading clinical psychologists in the United States (See the
February 1986 issue of the American Psychologist, pp. 191– problems?
193, for some rejoinders to this article.) 4. Could adaptive testing be used in the exams
Schoenfeldt, L. F. (1989). Guidelines for computer- you take in this class?
based psychological tests and interpretations. Comput- 5. What are some of the ethical issues involved
ers in Human Behavior, 5, 13–21. in using computers in psychological testing?
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

18 Testing Behavior and Environments

AIM This chapter looks at a particular point of view called behavioral assessment and
contrasts this with the more traditional point of view. We look at a variety of instru-
ments developed or used in behavioral assessment to illustrate various issues. We then
turn our focus to four broad areas of assessment that transcend the individual: program
evaluation, the assessment of environments, the assessment of family functioning, and
finally, some broad-based, flexible techniques.

Kent, 1972). Behavior has both antecedents and


TRADITIONAL ASSESSMENT
consequences. For example, the antecedent may
Much of traditional personality assessment and be a stimulus, such as seeing a snake, and conse-
therefore testing is based upon psychodynamic quences, what occurs after the response, may be
theory, as found in Freud’s writings, for exam- being rewarded by attention as one tells how one
ple, and trait theory, as the work of Gordon met the great rattlesnake. In behavioral assess-
Allport and of Raymond B. Cattell, for exam- ment, problem behaviors such as fear of snakes
ple. Both of these approaches view personality are not seen as a sign of an underlying trait of
as the central aspect to understand, predict, or phobic personality or maladjustment, but is the
alter behavior. Both of these approaches assume problem itself. The test responses are not a “sign”
that there are a number of dimensions called but rather a sample of behavior to be interpreted
traits (or drives, needs, motives, etc.) that exist directly. In behavioral testing, the focus is to
within the individual, are relatively stable, and assess the behavior directly rather than a hypoth-
give consistency to behavior – that is, knowing esized trait. Originally, such assessment did not
that a person is high on aggression allows us to use questionnaires or tests, but focused on direct
predict with some accuracy that the individual behavioral observation. Psychological tests were
will behave in certain ways across a number of viewed with distrust and not used (Greenspoon &
situations. In both of these approaches, we infer Gersten, 1967). Eventually, however, even behav-
that certain dimensions exist and that behavior ioral assessment began to develop questionnaires,
is a “sign” of such underlying dimensions. Thus, rating scales, and so on.
the responses of a subject to the Beck Depression Another aspect of behavioral assessment is that
Inventory are seen as evidence that the subject is trait labels are translated into operational defini-
(or is not) depressed. The test performance is an tions. Thus, a behavioral assessment question-
indicator, a sign of the underlying hypothesized naire would not ask, “are you depressed?” but
construct. would ask, “how many times in the past month
Behavioral assessment, on the other hand, did you have crying episodes?” Similarly, alco-
does not use such inferences, but looks at the holism might be translated into number of bever-
specific variables that control or affect behav- ages consumed per day and insomnia as number
ior in a specific situation (e.g., Goldfried & of hours spent sleeping.
483
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

484 Part Five. Challenges to Testing

Today, many psychologists would argue that (1) Different measures presumably of the same
the two approaches differ in their assumptions behavior, did not correlate significantly with each
about behavior rather than necessarily in the spe- other, even when these consisted of direct obser-
cific techniques. In the traditional view, behavior vation; (2) There was a proliferation of non-
is seen as the result of relatively stable causal vari- standardized behavioral assessment techniques,
ables, assumed to exist within the person – per- many of which were not psychometrically sound;
sonality. Thus, the focus of assessment is on what and (3) The available techniques did not result in
the person has. By contrast, behavioral assess- differential diagnosis, that is, in ways of classify-
ment focuses on what the person does in a par- ing clients into discrete groupings.
ticular context. Behavior is seen as the result of
both organismic variables, such as biological and
Motoric, Cognitive, and Physiological
genetic influences, and the current environment.
Behavior Assessment
Behavioral assessment focuses on behavior,
BEHAVIORAL ASSESSMENT
and behavior has been traditionally categorized
Beginning in the late 1940s, there was a marked as either motoric, cognitive, or physiological.
shift in the United States from a psychody- Motoric responses are probably the most com-
namic approach to a behavioral approach in monly assessed, in part because they can be rel-
the treatment of maladaptive behaviors. Various atively easy to observe. Cognitive responses are,
techniques of behavior therapy were developed from a behavioral assessment point of view, more
and promulgated. These techniques required the difficult to define and measure. Typically these
assessment of the specific client behaviors (such responses involve either thought or emotion, and
as phobias) that needed to be changed, as well only the outward result can be observed. The
as specification of what variables elicited and thoughts and feelings are “private” events, even
maintained the maladaptive behaviors. Tradi- though they can be verified empirically. If Brian
tional assessment procedures such as the MMPI says, “I’m angry,” we watch his facial expressions,
or projective techniques were seen by some as not or we observe him kicking the chair.
useful, and so alternative approaches were devel- Physiological responses such as heart rate,
oped that are now called behavioral assessment. galvanic skin response, respiration, etc., can be
Although behaviorism is not a new idea, the somewhat easier to measure provided we have the
application of behavioral principles as a ther- right equipment, but difficult to measure outside
apeutic procedure or as an intervention strat- of medical or research settings.
egy really began in the late 1960s and early It is interesting to note that typically there is
1970s. At first, these efforts did not focus on only a moderate correlation between measures
assessment, but eventually assessment became of the same variable in the three categories. For
a major priority. Early behavioral assessment example, the verbal report of anxiety may be
focused on motor behavior, but today behav- present even though the physiological signs are
iorists view all activities as behavior. Behavioral not existent.
assessment has broadened its scope and readily One way to categorize the techniques used in
includes such aspects as physiological-emotional behavioral assessment is to label them as direct or
behavior and cognitive-verbal behavior (Nelson indirect, based on the degree to which the actual
& Hayes, 1979). Some behaviorists even believe target behavior is measured in a particular set-
that projective techniques, which would seem to ting, that is the degree to which the responses
be the antithesis of behavioral assessment, can observed match the behavior of interest (Cone,
be useful with a behavior assessment framework 1977; 1978).
(e.g., Prout & Ferber, 1988).
If the 1960s were the beginning of behav-
Direct Assessment Methods
ioral assessment, the 1970s can be considered
the honeymoon period. This was followed by a 1. Observation. Direct observation is preferred
period of disillusionment (S. C. Hayes, Nelson, & by behaviorists because observations are empiri-
Jarrett, 1986), due in particular to three findings: cally verifiable and do not require any inference.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 485

Behavioral observations can take place in a nat- J. L. Jackson, 1999). The accuracy with which an
ural setting, such as a playground or a classroom individual records his or her own behavior is an
(and are called naturalistic observations), or in issue, with some studies report a high degree of
an “analog” setting, such as a laboratory task that accuracy, and other studies, a low degree of accu-
tries to simulate a real-life procedure. The obser- racy (e.g., Nelson, 1977). Self-monitoring also
vation may be “obtrusive,” where the subject is introduces the problem of reactivity; the behav-
aware that they are being observed, or “unobtru- ior may change because it is being monitored.
sive.” Unobtrusive observations often focus on For example, cigarette smoking may decrease
aspects of the environment rather than the per- simply by having the subject record how many
son, for example, counting the number of empty cigarettes he or she smokes and under what
beer bottles in someone’s thash as an indicator of conditions.
drinking behavior. From a practical point of view,
such observations can be easy in some cases and 3. Role playing. Sometimes direct observation
nearly impossible in others. We can easily observe is disruptive, difficult to implement, costly, or
a child on the playground or in the classroom to simply not practical. Although we can observe
assess his or her aggressiveness, but it would be children in the classroom, it is more difficult
more difficult, for example, to observe an adult to observe adults in the workplace. Role play-
executive as he or she interacts with subordinates ing can then be used by setting up artificial sit-
and peers to assess anxiety. From a psychomet- uations, for example, in a therapy context the
ric point of view, however, there are a number therapist may play the role of the boss. Such role
of challenges. Is what is to be observed so spe- playing has also been translated to tests where
cific that two observers will agree as to the occur- vignettes or instructions to the respondent to pre-
rence of the behavior (i.e., interrater reliability)? tend that they are in a specific situation are used.
Does the act of observing alter the behavior, e.g., Responses, either open-ended, or choice among
“If Alberto knows he is being observed, will he options, can then be scored (see Reardon, Hersen,
throw that spitball?” What is to be recorded? Is Bellack, et al., 1979, for an example of assessing
it the occurrence of a behavior (Alberto throws social skills in children). The basic assumption of
a spitball), the frequency (Alberto threw six spit- these tests is that the way the client responds in
balls in the past hour), the antecedents (Alberto this simulation is the way the client will respond
throws spitballs when the teacher ignores him), in the real-life situation. Whether that is in fact
etc.? What coding system shall be used (e.g., Is the case, is debatable (e.g., Bellak, Hersen, &
“throwing spitballs” an instance of “aggressive Lamparski, 1979).
behavior”?) (See Wahler, House, & Stanbaugh, In addition to these three approaches, behav-
1976, for an example of a coding scale that covers ioral assessment uses a wide variety of techniques,
24 categories of behavior, for the direct obser- such as laboratory tasks and psychophysiologi-
vation of children in home or school settings.) cal measures that are beyond the scope of this
In studies of behavioral assessment, direct obser- chapter.
vation is the most common assessment proce-
dure used, followed by self-report (Bornstein,
Indirect Assessment
Bridgwater, Hickey, et al., 1980).
Here the behavior of interest is not observed
2. Self-monitoring. Here the subject observes directly, but the subject, or someone who knows
his or her own behavior and records the results, the subject well, is asked about the behavior. Thus
for example, the amount of food eaten. Subjects one must make behavioral inferences about the
may be asked to observe not only the behavior data collected and must empirically verify the
but the contingencies surrounding the behavior: data.
Where does the eating take place? Who else was
there? What were the thoughts and feelings asso- 1. Interviews. This is perhaps the most fre-
ciated with the eating? Although this appears to quently used technique in behavioral assessment.
be a simple procedure, from a psychometric point Interviewing is in and of itself a major topic with
of view there are also a number of challenges (e.g., a voluminous body of literature. If one considers
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

486 Part Five. Challenges to Testing

interviewing as essentially an oral test, then it 2. Interviews are conducted for different pur-
is legitimate to ask questions about reliability poses, in different ways, by different examiners.
and validity. Interestingly, there is relatively lit- Therefore, issues of reliability and validity can-
tle information about the reliability and valid- not address interviewing in general, but must
ity of behavioral interviewing. One major excep- pay attention to the various aspects and circum-
tion is the work of Bergan and his colleagues stances. This is a clear situation where generaliz-
who have developed a behavioral-consultation ability theory (as discussed in Chapter 3) could
model, which sees interviewing as a series of ver- be useful.
bal exchanges. These exchanges can be described 3. Reliability would seem important to establish,
in specific, operational terms, and the type and but very few studies do so. In the area of behav-
quantity of specific types of verbal statements can ioral assessment, a number of studies have looked
be coded and monitored, thus allowing for assess- at interrater agreement, for example, by taping
ments of reliability and validity (see Bergan, 1977; interviews and scoring such interviews along
Bergan & Kratochwill, 1990). specified dimensions, by independent observers.
S. N. Haynes and Wilson (1979) indicate that Such interrater agreement is often relatively high.
the interview is the most frequently used, the 4. Internal-consistency reliability, for example,
least systematically applied, and the least fre- by repeating items within the same interview, is
quently evaluated behavioral-assessment instru- rarely evaluated.
ment. Recall that in Chapter 1 we suggested a
5. A number of studies have looked at criterion-
test could be usefully considered as an interview.
related validity, where interview results are com-
We can now turn that around and consider the
pared with data from other assessment instru-
interview as a test.
ments. Results differ widely, with some studies
In behavioral settings, interviews are often
reporting high validity coefficients and others
used to screen clients for possible participation
low; what is needed is to determine under what
in specific therapeutic interventions and as diag-
conditions valid results are obtained.
nostic instruments. Interviews can also be used to
identify subjects for research and/or clinical stud- 6. Content validity seems particularly appropri-
ies, or to obtain information from clients as to the ate and adequate in structured interviews, which
cognitive components of their behavior disorders essentially represent a set of questions (much like
because altering such cognitive components (e.g., the MMPI) read aloud by the examiner.
negative self-thoughts) can, in effect, ameliorate 7. The validity of interviews can also be evaluated
the disordered behavior. Finally, interviews can in a pre-post design, where an interview precedes
be used to evaluate intervention outcomes. and follows an experimental or therapeutic inter-
There is a vast body of literature on interviews, vention. To the extent that the interview is valid,
both from the behavioral-assessment aspect and the data obtained in the interview should covary
from a more general perspective; to review such with the results of the intervention.
literature would take us far afield. Studies that 8. There are many potential sources of error in
have been conducted on the validity of the inter- interviews; not only are these self-reports, but
view are not particularly encouraging. For exam- interactions between such aspects as the gender
ple, in one study (Sarbin, 1943) academic success and ethnicity of the interviewer vs. the intervie-
was predicted for a sample of college freshmen on wee can introduce considerable “error” variance.
the basis of tests and interviews. Validity coeffi- 9. In general, from a psychometric point of view,
cients for the tests alone were .57 for men and .73 interviews are less preferable.
for women. When interview data was added, the
coefficients were .56 and .73, in other words, the
interview added nothing. However, we can arrive Structured vs. unstructured interviews. Inter-
at some general conclusions: views can vary from very unstructured to highly
structured. In an unstructured interview, there
1. In spite of the ubiquity of the interview, there is a goal (e.g., Is this a good candidate for
is relatively little literature on its reliability and this position?), but the format of the interview,
validity. the sequence of questions, whether a topic is
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 487

covered or not, are all unstructured. Unstruc- have the degree of sophistication and complex-
tured interviews are often used in therapeutic ity found in instruments such as the MMPI or
settings, for example when a clinician conducts the SVIB. When behavioral-assessment instru-
an initial interview with a client. The unstruc- ments first became popular, there was a ten-
tured interview allows a great deal of flexibility dency to reject classical psychometric principles
and the potential to go in new or unplanned for of reliability and validity, and to be more con-
directions. cerned with other issues more directly relevant
In a highly structured interview, the goal may to behavioral observation. Subsequently, psy-
be the same, but the procedure, questions, etc., chometric concepts were seen as quite applica-
are all predetermined and standardized. Thus, the ble, and while traditional testing was “rejected,”
structured interview lacks flexibility but provides traditional psychometric concepts were not.
standardization. From a psychometric point of Checklists and rating scales are economical,
view this is a preferable strategy because it permits can be easily administered, and serve to focus sub-
quantification and comparability across inter- sequent interviewing and observational efforts.
viewers, situations, and clients. They are useful to quantify observations. In addi-
tion, these are typically normative instruments;
Types of interviews. There are, of course, they allow for the comparison of a specific child
numerous types of interviews and different ways with norms. They also can be used to quantify any
of categorizing them. There are employment change that is the result of intervention efforts. At
interviews whose goal is generally to decide on a the same time, because they are indirect measures
person’s qualifications as to a particular position. of behavior, behaviorists tend to be suspicious of
There are psychiatric intake interviews where ini- them and are concerned that the obtained data
tial information about a client is obtained. There may be affected by social desirability, situational
is the mental status exam, which is an interview aspects, and a host of other limitations.
that ascertains what abnormalities are present in We used the terminology of bandwidth and
the client (see Chapter 7). There are exit inter- fidelity earlier, and there is a parallel between
views, polls, case histories, and various other these terms and the direct-indirect dichotomy.
types of interviews. Indirect methods are essentially broad-band, low
Regardless of the type of interview, we can fidelity, whereas direct methods are narrow-
conceptualize the interview as made up of three band, high fidelity. Broad-band low fidelity
aspects: the interviewer, the interviewee, and methods provide more information, but at a
the interview process itself. Each of these three lower quality.
aspects also represents a source of error in terms Note also, that these indirect methods could be
of reliability and validity. considered as “self-reports.” From a behavioral
assessment point of view, self-report has several
2. Checklists and rating scales. These are used advantages. The observer, who is also the subject,
widely in behavioral assessments and are com- is always present when a specific behavior occurs,
pleted either by the individual as a self-report or and the observer can observe “internal” events
by someone else as an observer’s report. Both the (thoughts and feelings).
Child Behavior Checklist (Achenbach, 1991) and
the Conners rating scales (Conners, 1990), which
we covered in Chapter 9, are good examples of Concerns. From a psychometric point of view,
behavioral checklists and rating scales. Both focus substantial concern has been voiced that the
on specific behaviors rather than hypothesize procedures used in behavioral assessment have
personality traits. In general, self-report scales not been subjected to the scrutiny that typically
that emanate from a behavioral perspective dif- accompanies the creation of a new scale. What is
fer in two major ways from their more traditional more important, is the concern that the method-
counterparts. First, the items typically focus on ology used in behavioral assessment is relatively
behavior and use behavioral terms. Second, many unsophisticated and does not take advantage of
(but not all) have been developed informally psychometric principles that could make it more
so that their psychometric structure does not useful.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

488 Part Five. Challenges to Testing

Behavior modification. Much of what is done A fourth difference is that behavioral assess-
under the rubric of behavioral assessment is for ment typically uses multiple sources of data col-
the purpose of behavior modification: the appli- lection. Not only are interviews conducted, but
cation of learning principles such as reinforce- behavioral observations are made, and checklists
ment and extinction to change behavior, specif- and questionnaires used.
ically to eliminate unwanted behavior, such as Another difference is that in behavioral assess-
fear of snakes, and to strengthen positive behav- ment there is typically a rather strong insistence
ior, such as doing well academically. Most of on clarity of definition and precision of measure-
the questionnaires developed focus on behav- ment. These are of course highly desirable, but in
ioral dysfunctions rather than, for example, the traditional testing they may be absent or not as
enhancement of creative achievement, happi- strongly emphasized.
ness, superior intellectual functioning, and so on. Finally, traditional assessment typically either
precedes treatment or uses a pre-post design.
For example, a patient is assessed as depressed,
TRADITIONAL VS. BEHAVIORAL
is given 6 months of psychotherapy and med-
ASSESSMENT
ications and is again evaluated at the end of 6
Both approaches attempt to predict human months. With behavioral assessment, the assess-
behavior, but there are a number of differences ment is typically ongoing and multiple in nature.
between the two. One difference is that the assess- For a more detailed analysis of the differences
ment information obtained should be directly between traditional and behavioral approaches,
relevant to the specific goals of the assessment. see D. P. Hartmann, Roper, and Bradford (1979).
If a client has a snake phobia, behavioral assess-
ment attempts to identify the day-to-day condi-
VALIDITY OF BEHAVIORAL ASSESSMENT
tions related to the phobic behavior. Traditional
testing would attempt a broader perspective and In Chapter 3, we discussed validity – specifi-
determine what type of person this individual is, cally content – criterion – predictive and con-
and so on. The concepts of fidelity and bandwidth current – and construct validity. Initially, a num-
are again relevant here. ber of behaviorists argued that these traditional
A second difference is that traditional assess- concepts did not apply to behavioral assessment,
ment depends on the notion of intervening vari- but the more acceptable view now is that they do
ables. That is, between stimulus and response, (e.g., Cone, 1977).
the assumption is that there are personality traits, Content validity is very important in behav-
motivational needs, and so on. Thus, the aim of ioral assessment because it basically reflects ade-
traditional testing is to map these inferred con- quate sampling (the observations made must
structs to predict overt behavior. In behavioral be representative of the behavior of interest).
assessment, the focus is on the behavior itself, Another way to look at this issue is to determine
and the aim of testing is to obtain a sample of to what degree an observed behavior is specific
that behavior without making inferences about to a particular situation (the same issue discussed
the underlying constructs. under generalizability).
A third difference is that behavioral assessment Predictive validity refers to the accuracy of a
is based on learning principles that include both test to predict specific types of behaviors. This
specificity and generality. A behavioral anxiety also is important in behavioral assessment, par-
scale would try to determine under what con- ticularly in the use of checklists and rating scales.
ditions and in what situations the client experi- In a number of studies, the predictive validity of
ences the symptoms of anxiety. Traditional test- such scales, as defined by correlations with real-
ing assumes that behavior is fairly consistent life behaviors, has been less than adequate.
across situations – a person who is anxious when In behavioral assessment, concurrent validity
giving a talk to a class will also be anxious on a is often defined as how well a particular scale or
blind date – thus the items of a traditional anxiety observational method correlates with other scales
scale are generic (e.g., I am anxious most of the or methods designed to assess the same behavior.
time). Concurrent validity data is frequently available
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 489

and seems to be a common technique in behav- (1) scorer, (2) item, (3) time, (4) setting, (5)
ioral assessment. method, and (6) dimension.
Finally, there is construct validity that looks at
1. Scorer generalizability refers to the degree to
the underlying dimension that a test is suppos-
which data obtained from one observer or scorer
edly measuring. This type of validity, as discussed
matches that obtained from a second observer.
earlier, is theory-driven and requires inferential
Interrater reliability would be an example, as well
thinking and hypothetical variables. This seems
as the degree to which the father’s ratings of a
to be not entirely applicable to behavioral assess-
child compare with the mother’s ratings. Note
ment because the focus is on direct observation
also that two observers could agree with each
and not on underlying constructs. On the other
other, yet both could be wrong.
hand, in dealing with concepts such as “social
skills,” the notion of construct validity seems 2. Item generalizability can be translated into
quite germane. internal consistency – i.e., do all of the items mea-
sure the same phenomenon or behavior?
3. The issue of time is one of stability – i.e., test-
Social validity. In the area of behavioral assess-
retest correlation. The concern here is about to
ment, there is an additional type of validity called what extent data obtained at one point in time are
“social validity” that refers to the effectiveness of comparable with those obtained at other points
a behavior change as assessed by its utility. For in time.
example, we could easily teach a child to increase
4. Setting generalizability refers to the degree to
eye contact with others. However, such a change
which data obtained in one situation are rep-
would not necessarily increase that child’s social
resentative of that obtained in other situations.
skills or alter the nature of his or her interac-
From a traditional point of view, personality tests
tion with peers, so the behavior change would
attempt to assess the typical way in which a per-
have little social validity (Kazdin, 1977). Social
son behaves, although it has been argued that
validity can be assessed in a number of ways, for
behavior is not consistent across situations.
example, by assessing “consumer satisfaction” or
comparing the experimental group to a control 5. Method generalizability refers to the degree
group which did not receive a treatment (Shapiro, that data obtained by different methods are in fact
1987). Social validity really refers to applied inter- comparable. Do responses on a self-report mea-
vention programs rather than to the tests that sure correspond to responses on a behavioral-
might be used to assess the efficacy of such inter- avoidance test?
ventions. Kazdin (1977) defined social valida- 6. Dimension generalizability refers to the com-
tion as consisting of two procedures. First, the parability of data on two or more different behav-
behavior of the target subject is compared with iors – i.e., essentially construct validity. An exam-
the behavior of peers who have not been iden- ple here might be whether scores on a measure
tified as problematic. Second, subjective evalua- of assertiveness are inversely related to scores on
tions of the person’s behavior are obtained, either a measure of anxiety.
from the subject, or better still from “objective”
observers.
Self-reports. The reliability and validity of infor-
mation gathered from an individual, i.e., self-
Generalizability. As discussed in Chapter 3, an report, has always been a controversial topic
alternate way to consider reliability and valid- within psychology. For psychoanalysts, self-
ity is through generalizability theory. That is, the reports based on such techniques as free associa-
relationships of a given measure with either itself tion (“what comes to mind when . . .”) and retro-
(reliability) or other criteria (validity) reflect the spection (“tell me about your childhood”), were
fact that a given score can be generalized in dif- presumed to be reliable, that is replicable, but not
ferent ways. Cone (1977) suggests that general- particularly valid because what the patient said
izability theory can provide a way to evaluate was distorted by ego defenses, needs, early child-
behavioral assessment techniques and indicates hood experiences, and so on. For the behaviorists,
that there are six aspects to such generalizability: as exemplified by John B. Watson, self-report was
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

490 Part Five. Challenges to Testing

basically rejected – the focus was on behavior, on vational procedures can affect the behavior of
what the client did rather than what the client the observed subjects. In some studies, sub-
said. Those who believed in the trait approach, jects showed a greater degree of behavior when
on the other hand, saw self-report inventories observed than when not observed, but in other
as the way to go, with the assumption that sub- studies, just the opposite effect was obtained.
jects could provide reliable and valid responses Thus, reactivity may be influenced by a number
on such tests. Even when the possibility of dis- of variables including how obtrusive the observa-
tortion was accepted, the solution was to create tional procedure is, the expectancies of the sub-
self-report scales, such as the MMPI, that could jects, the types of behaviors being observed, and
detect such faking. so on.
3. Expectancy. If for example, we are conducting
Overall score. When one has a questionnaire, it behavior therapy with a child who is hyperactive
is tempting to develop an overall score, a total. and ask the teacher to fill out a post-treatment
As Bellack and Hersen (1977) point out however, questionnaire, the expectancies that the teacher
the use of an overall score to summarize typical has about the effectiveness of the therapeutic
behavior or to predict some outcome is a trait procedure may well influence the ratings.
concept.

Reliability and validity of observations. For Interobserver agreement. The accuracy of


observations to be valuable, they must be reliable observations is typically assessed by interobserver
and valid. In general, reliability is increased when agreement, raising the issue of interrater reliabil-
the observer is trained, when the behavior to be ity. In addition to the Pearson correlation coef-
observed is well defined, when the observations ficient, there are a number of other indices used
are specific (Tommy hits the other child) rather such as percentage agreement, discussed in Chap-
than general (Tommy shows aggressive behav- ter 15. There are a number of statistical issues
ior), and when the observations are recorded that complicate the picture (see the 1977 Journal
using an unambiguous system. of Applied Behavior Analysis). One of the issues
is that of chance agreement. If two observers are
Evaluating behavioral assessment techniques. independently observing the same behavior, they
Behavioral-assessment techniques need to be may agree on the basis of chance alone. The kappa
evaluated as with any instrument, primarily in statistic (J. Cohen, 1960), takes into account such
terms of reliability and validity. In addition, there chance agreement.
are a number of other issues relevant to tradi-
tional tests that seem somewhat more visible in BEHAVIORAL CHECKLISTS
the area of behavioral assessment:
Although behavioral checklists and rating scales
1. Sensitivity. To what degree does the test reflect are somewhat different, the two terms are some-
changes in the behavior? If for example, through times used as synonymous. A behavioral checklist
behavior therapy, we change a person’s behav- consists of a list of fairly specific behaviors, and
ior so that they are now comfortable in visiting the person completing the checklist indicates the
a snake exhibit at the local zoo, does our scale presence or absence of each behavior, thus check-
reflect this change? lists require a series of binary decisions. A rating
2. Reactivity. This refers to changes in the behav- scale on the other hand, typically involves making
ior as a function of the measurement. For exam- a judgment where the available responses are at a
ple, if we ask someone to self-record the amount minimum three (e.g., is this person low, average,
of food they eat on a daily basis, will the very or high on honesty?), and where the judgment
monitoring result in lesser food consumed? If made typically reflects a global score that sum-
we watch little Johnny in the classroom, will marizes a number of observations. For example,
he engage in more or less disruptive behavior as a teacher, if I am asked to rate Johnny on how
than is typical? S. N. Haynes and Wilson (1979) attentive he is in class, using a 5-point scale from
reviewed a number of studies showing that obser- very inattentive to very attentive, my judgment
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 491

will be a global judgment, hopefully summariz- reliability may be low and, in essence, inappro-
ing a large number of observations. priate. Where appropriate, test-retest reliability
There are a large number of behavioral check- is generally higher the shorter the time period
lists and behavioral rating scales used both for between test and retest. We also need to distin-
research purposes and in applied settings. In 1973 guish, at least theoretically, between test-retest
a group of researchers (Walls, Werner, Bacon, reliability and a pre-post research design where
et al., 1977) placed an ad in various professional the instrument is given twice to assess the effec-
publications to locate behavior checklists that tiveness of an intervention. Good research prac-
could be used with mentally retarded, psychiatric tice would require the use of a control group as
patients, children, or other populations. They well, which typically would yield a better measure
were able to obtain and tabulate more than 200 of reliability.
such instruments, differing quite widely in their Internal consistency is relevant only to the
psychometric structure, reliability, and validity. degree that the particular scale measures a uni-
These instruments differed from each other along tary dimension or factor.
a variety of dimensions. McMahon (1984) iden- Interrater reliability can also, in an applied sit-
tifies three such dimensions: uation, be confounded with other aspects. For
example, the ratings of teachers may not fully
1. Informants – i.e., who fills the scale out? It coincide with those given by mothers. However,
could be the client, peers, teachers, parents, co- such differences may not necessarily represent
workers, ward personnel, therapist, and so on. low interrater reliability or error, but may be legit-
Some scales have been developed for use with imate sources of information.
specific informants; some scales may have parallel
forms for use by different informants, and some Validity of checklists. Content validity as well as
scales can be used by any informant. criterion-related validity (both concurrent and
2. Scope – i.e., does the instrument cover a vari- predictive) would seem to be of essence.
ety of behaviors or only one type? Is the age range What affects the validity of behavioral
covered broad (e.g., children) or more restricted checklists? S. N. Haynes and Wilson (1979) iden-
(e.g., preschool children)? tified eight types of variables that could affect the
3. Structure – this refers to a variety of aspects. validity:
For example, some scales are composed of only
1. Response bias. The response to one item may
one item whereas others are quite lengthy. Some
be affected by the responses to other items.
items can be very specific and others quite global.
Response bias means that different item arrange-
Some checklists require yes-no responses, others
ments may generate different responses, thus
provide more options, such as a 7-point response
affecting validity. We saw an example of this with
scale. Anchors can be very specific or more global
the Beck Depression Inventory.
(e.g., “at least three or more headaches every
day” vs. “I suffer from headaches frequently”). 2. Social desirability. In behavioral checklists,
The time period which the informant uses as a this might likely take the form of overreporting
frame of reference (and typically included as part positive behaviors and underreporting negative
of the instructions) can vary – e.g., “how many behaviors.
headaches have you had in the past 6 hours” vs. 3. Demand factors. Because behavioral check-
“how many headaches have you had in the past lists are often used in therapeutic situations to
year?” decide on a therapeutic intervention, for exam-
ple, the situation may result in “different” scores
than if there were no such demand factors – e.g.,
Reliability of checklists. Test-retest, internal anonymous conditions.
consistency, and interrater reliability are prob- 4. Expectancies. Similarly, the client or respon-
ably the most pertinent types of reliability for dent may have certain expectancies. Teachers for
behavioral checklists and for rating scales. If the example, who fill out the checklist on children
behavior to be rated is very specific (e.g., Johnny to be referred to a school psychologist, may have
pulls his little sister’s pony tail), then test-retest certain expectancies about what services will be
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

492 Part Five. Challenges to Testing

Table 18–1. Examples of Behavioral Questionnaires


Reference Questionnaire Assesses
Lang & Lazovik (1963) Fear Survey Schedule Common fears
Thorne (1966) Sex Inventory Deviant sexual behaviors
Lanyon (1967) Social Competence Heterosexual interactions
Wollersheim (1970) Eating Patterns Questionnaire Eating patterns
J. P. Galassi & M. D. Galassi (1974) College Self-expression Scale Assertiveness
Denney & Sullivan (1976) Spider Anxiety Fear of spiders
Behar (1977) Preschool Behavior Preschool behavior
Questionnaire
R. L. Weiss & Margolin (1977) Marital Conflict Form Marital problem areas
Gresham & Elliott (1990) Social Skills Rating System Social skills
T. M. Fleming et al. (1991) Roommate Relationships Roommate relationships
Gillespie & Eisler (1992) Feminine Gender Role Stress Gender role stress
Greenbaum, Dedrick & Lipien (2004) Child Behavior Checklist Children’s Behaviors

provided and what ratings may be required to assertiveness, phobias, alcohol and drug abuse,
initiate such services. marital interactions, and so on. There are liter-
5. Population characteristics. A checklist may ally hundreds of such questionnaires available.
have been validated on one group, such as col- Table 18.1 lists some examples of such question-
lege students, but may not necessarily be valid naires. Perhaps the largest category of question-
with another group, such as counseling-center naires falls under the rubric of problem behav-
clients. iors, which include a wide variety of behaviors
6. Observer reactivity. Although there is a sub- such as school-conduct problems, aggressiveness,
stantial body of literature on the reactive effects suicidal intent, obsessive-compulsive behaviors,
of self-monitoring, substantially less is known hyperactivity, lack of assertiveness, and so on. For
about reactivity associated with self-report. For an excellent review of behavioral questionnaires
example, does reporting that one often gets into in a wide variety of areas, see Chapter 6 in S. N.
fights with one’s spouse make the subject take Haynes and Wilson (1979).
pause and try to alter that behavior?
7. Situational and behavioral specificity. This
refers to the items in a particular checklist. The Measurement of Anger
less specific and concrete an item, the greater the One of the very common problems of children
probability that the item can be interpreted (and seen for psychological or psychiatric services is
misinterpreted) in a variety of ways. the management of anger. Anger is a subjec-
8. Scale length. As a general rule, the longer the tive, internal experience that may or may not be
test, the greater the reliability, and to a lesser reflected in aggressive behavior. Thus, self-report
degree, the greater the validity. However, there would seem to be the most direct method of mea-
may well be situations where a shorter checklist surement (Finch & Rogers, 1984).
of carefully chosen items, or a checklist requiring One such scale is the Children’s Inventory of
more limited but better defined responses (e.g., a Anger (CIA; Finch & Rogers, 1984), a 71-item
three-way judgment of below average, average, or self-report inventory where each item is a mini-
above average vs. a 15 item response) may result vignette (e.g., “the bus driver takes your name
in greater validity. for acting up on the bus, but so was everyone
else”). The child responds by selecting one of
four “happy” faces that portray four reactions
BEHAVIORAL QUESTIONNAIRES
from happy to angry. Each face also has a verbal
A large number of questionnaires are now avail- descriptive anchor.
able to assess a variety of behaviors such as A study of emotionally disturbed children
sexual behaviors, interpersonal behaviors like over a 3-month period, indicated a test-retest
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 493

reliability of .82. Spearman-Brown reliability for Test-retest reliability over a 2-week interval was
a much larger sample of normal and emotionally .19 for the pleasing score, .59 for the displeasing
disturbed children yielded a coefficient of .91 for score, and .66 for the total score. The authors
split-half and a coefficient of .95 for odd-even; concluded that the DCBC has “satisfactory” reli-
Kuder-Richardson reliability was .96. Thus relia- ability – but one would certainly question that
bility, especially of the internal-consistency type, conclusion, although quite clearly the nature of
is exceptionally high. the variable (i.e., specific behaviors) is not nec-
Concurrent validity was assessed by com- essarily temporally stable – if Daniel whines on
paring CIA scores to peer ratings with signif- Tuesday, will he also whine 2 weeks later?
icant findings and by comparing CIA scores According to the authors, content validity
with results on another checklist of acting-out was established by having graduate students
behavior; these latter results were modest at best. and mothers sort the original pool of 68 items
A factor analysis indicated six factors such as into pleasing and displeasing categories. The 65
injustice and sibling conflict. Use of the scale retained items were those that showed at least
with individual children in treatment suggested 90% agreement in both groups. A second sample
that the scale reflected the changes in anger of mothers completed the DCBC for their child.
management brought about by the therapeutic No significant correlations were obtained with
treatment. demographic variables such as age and gender of
the child, socioecononic status, or number of sib-
lings. Pleasing scores correlated .65 and displeas-
The Daily Child Behavior Checklist (DCBC) ing scores correlated −.78 with total behavior
To assess problems of child behavior, mental- scores. The two subscales however did not corre-
health specialists often depend on parental late with each other (r = −.07). Concurrent valid-
reports, usually in the form of behavioral check- ity was assessed by comparing the DCBC scores
lists. As we have said, checklists have several with four subscales of a Parent Attitude Test. A
advantages; they are economical, easy to admin- significant pattern of correlations was obtained,
ister, and easy to score. Checklists can be used to supportive of concurrent validity.
help define specific problems and can be used to
jog the informant’s memory. Laboratory procedures. In this textbook, we
Furey and Forehand (1983) criticized avail- have emphasized paper-and-pencil tests because
able checklists as containing items that were not these are the largest category of tests, they are the
equal (for example “shy” vs. “wets the bed at ones that you most likely would meet either as
night”), that focused on negative behaviors, and a professional or as a lay person, and they are
that aimed at global perceptions over a substan- the ones that most clearly illustrate basic issues,
tial time period. They set out to develop a check- such as reliability and validity. However, there are
list that (1) contained both positive and nega- many tests and procedures used in psychological
tive behaviors; (2) might be used daily; (3) used research and assessment with formats different
specific and objective items that, and (4) focused from paper-and-pencil. For example, in the area
on factual events rather than attitudes, emotions, of sleep, researchers use the electroencephalo-
and so on. gram to determine the stages of sleep; they may
The resulting checklist contains 65 items, 37 also be interested in measuring body tempera-
positive (e.g., “gave parent a hug”) and 28 nega- ture, heart rate, respiration, penile erection, pos-
tive (e.g., “refused to go to bed when asked”). The tural changes, and so on. These instruments also
items were gathered from a variety of sources, but need to be evaluated from the perspectives of
little information is given as to how specific items reliability and validity (see W. W. Tryon, 1991,
survived selection. Three scores are obtained for an excellent review of activity measurement).
from the DCBC: (1) total number of pleasing As an example of these procedures, let us take
(positive) behaviors checked; (2) total number a brief look at penile tumescence measurement,
of displeasing behaviors checked; and (3) total typically used as a measure of sexual arousal in
score (pleasing minus displeasing behaviors). males.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

494 Part Five. Challenges to Testing

Penile Tumescence Measurement (PTM) ity, general homosexuality, adult heterosexuality,


and neutral stimulation. This was interpreted by
Penile measurement techniques provide infor-
the authors as showing satisfactory internal con-
mation about a male’s sexual interests by measur-
sistency. A number of the stimuli were repeated;
ing penile response to various categories of sex-
the test-retest correlations for these items ranged
ual stimuli. One type of penile gauge is a device
from .71 to .86 for female stimuli, with a mean r
that involves the encasement of the penis in a
of .82, and from .21 to .74 with a mean r of .61
glass cylinder. Sexual arousal of the penis causes
for male stimuli.
air displacement in the glass cylinder, and thus
volume changes can be recorded (McConaghy,
Alternate-forms reliability. The three major
1967). Other types of gauges have also been devel-
types of gauges presumably measure the same
oped (see R. C. Rosen & Beck, 1988).
variable, penile tumescence, taken as an indicator
O’Donohue and Letourneau (1992) reviewed
of sexual arousal. O’Donohue and Letourneau
the psychometric properties of such gauges in
(1992) reviewed four studies and concluded that
samples of child molesters, although fully real-
these different devices do not correlate highly and
izing that sexual arousal to children is neither
therefore cannot be considered as alternate forms
a necessary nor a sufficient condition for sex-
of the same measurement procedure.
ual offending. At the onset we should ask, as
these investigators did, what kind of assessment
Criterion validity. Validity criteria commonly
method does PTM represent? Is this a norm-
used are arrest records, criminal convictions, self-
referenced test where the results for a subject
report, and the reports of others. All of these cri-
are compared to those for a group of subjects?
teria are fallible. For example, O’Donohue and
Is PTM a criterion-referenced test where we have
Letourneau cite one study where the investiga-
some criterion of what is healthy or deviant, and
tors interviewed 561 nonincarcerated paraphili-
we can assess an individual accordingly? Or is
acs (sex offenders); they found a ratio of arrests to
PTM not really a test, but a direct observation of
commission of rape and child molestation to be 1
behavior? If the latter, then the major criterion
to 30, indicating a very high rate of false negatives
is whether this sample of behavior is representa-
(i.e., people who have not been arrested but have
tive and what can we infer from it. The writers
committed a sexual offense).
assumed that standard psychometric criteria (i.e.,
An illustrative study is that of Freund (1965),
reliability and validity) were indeed relevant.
who used colored slides of nude males and
females of different ages to compare two groups:
Reliability. Studies of child molesters usually 20 heterosexual pedophiles and 20 heterosexual
have few subjects, and the range of scores of PTM patients in an alcohol-abuse program. All sub-
is often quite restricted. Both of these aspects tend jects showed the greatest sexual arousal to female
to result in lower reliability coefficients. From slides; sex offender subjects showed the greatest
a reliability perspective, we would want consis- arousal to slides of children and least arousal to
tency of response to similar stimuli; in fact the slides of adults. Of the 20 sex offenders, the results
level of sexual arousal may be affected by a vari- misclassified only 1 subject, and of the controls,
ety of factors such as time since last orgasm and only 1 subject was misclassified as a pedophile.
drug use. In general, studies in this category show that
An illustrative reliability study is that of Frenzel child molesters can be differentiated from normal
and Lang (1989), who studied heterosexual and controls, and there is good sensitivity in general
homosexual sex offenders as well as a control with few false positives and fewer false negatives
group of heterosexual volunteers. The stimuli (Blanchard, Klassen, Dickey, et al., 2001).
were 27 film clips that were shown three times to
assess test-retest reliability. The film clips varied Differential diagnosis. One validity question is
as to gender and age of subject shown, and three whether PTM can distinguish sexual offenders
items were sexually neutral. Coefficient alpha was from normal controls, and the above suggests that
.93 for the 27 stimuli. A factor analysis of the data the answer is, “Yes.” Another question is that of
resulted in four factors: general heterosexual- differential diagnosis. Can the procedure be used
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 495

to differentiate among various diagnostic groups, Beck Anxiety Inventory) would be examples; (2)
in this case, for example, between child molesters those that measure anxiety as a state, that is as a
and rapists? response to a particular situation (e.g., how anx-
Unfortunately, the answer is complicated by ious do I feel right now as I’m about to take a final
the fact that many paraphyliacs engage in multi- exam). The state part of the State-Trait Anxiety
ple paraphilic acts. O’Donohue and Letourneau Scale would be an example; and (3) those that
(1992) report a study where in a sample of 561 measure specific types of anxiety, such as test
nonincarcerated paraphyliacs, 20% had offended anxiety or anxiety about public speaking. These
against both male and female targets, and 42% inventories could well be classified under the label
had offended against individuals in more than of “fear.”
one age group. Of those who had been arrested for
pedophilia, 33% had also engaged in acts of exhi-
Depression
bitionism and 16% in acts of voyeurism. Of those
who had been arrested for rape, 44% reported Depression can be seen as a behavioral disor-
acts of female pedophilia. Thus, there does not der characterized by dysphoric mood, accompa-
seem to be the existence of mutually exclusive nied by motoric changes (e.g., walking slowly)
subgroups of sexual offenders. Therefore, it is and physiological concomitants such as loss of
not surprising that results of such validity studies appetite (see Lewinsohn & Graf, 1973). The
present rather mixed findings. Pleasant Event Schedule (PES; MacPhillamy &
Lewinsohn, 1976) was developed specifically
Predictive validity. A few studies have exam- from a behavioral point of view and reflects
ined the predictive validity of PTM to assess sex- Lewinsohn’s theory about depression, which
offense relapse. The results are somewhat promis- essentially sees depression as a lack of positive
ing – in one study, 71% of the sample were reinforcement. The PES is intended to assess the
correctly classified. amount of external positive reinforcement the
individual receives.
Faking. Can PTM be faked? The answer seems Lewinsohn (1976) believes that the general
to be “Yes.” For example, in one study subjects goal for the behavioral treatment of depressed
instructed to pretend to prefer adult women had individuals is to restore an adequate schedule of
significantly lower arousal to child stimuli. Other positive reinforcement for the individual. Spe-
studies report that 80% of child molesters and cific intervention techniques can vary from per-
rapists can control their sexual arousal when son to person, and one approach is to use activity
asked to do so. schedules to increase the patient’s rate of behav-
iors that are likely to be reinforced by others or
Anxiety. Although there is disagreement about are intrinsically reinforcing for that patient.
anxiety and how it should be defined, it is gen- The PES consists of 320 events that were
erally agreed that anxiety is diffuse (as opposed elicited from various groups of subjects; several
to fear, which is specific). Therefore, anxiety does forms of the PES were developed, but form III
not result in a specific behavior (such as in fear with 320 items seems to be the standard one. The
of flying where one avoids airplanes). Anxiety is PES is used as a retrospective report of the events
an emotional reaction that centers on a subjec- of the last 30 days, as well as a daily log for ongo-
tive feeling of distress; there may be physiological ing behavior. Lewinsohn (1976) describes how
arousal (e.g., sweating) and motor disturbances the PES can be administered to a patient: a sub-
(e.g., stiffness of posture). group of PES items judged to be most pleasant
Because the subjective feeling of distress is at are targeted, and the patient is asked to indicate
the core of anxiety, then self-report is highly rel- at the end of each day which activities he or she
evant. Basically, there are three types of anxiety has engaged in.
scales: (1) those that measure anxiety as a trait – In using the PES as a retrospective report, sub-
i.e., as a general and stable disposition to become jects are asked to indicate how frequently each
anxious. The trait part of the State-Trait Anxi- item occurred within the last 30 days, on a 3-point
ety Scale discussed in Chapter 7 (as well as the scale: 0 (not happened); 1 (a few, 1 to 6 times);
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

496 Part Five. Challenges to Testing

and 2 (often, 7 or more times). Subjects then go patients was divided into three diagnostic groups
through the PES a second time and indicate how (major depression, nonmajor depression, and no
pleasant each event was, also using a 3-point scale: depression), significant mean differences on the
0 (not pleasant); 1 (somewhat pleasant); and 2 PES Mood scale were obtained, with lower scores
(very pleasant). From these two sets of ratings, associated with greater depression.
three scores are obtained: Activity Level which is In keeping with the behaviorist orientation,
the sum of the frequency ratings; Reinforcement the various published studies on the PES do not
Potential, which is the sum of the pleasantness mention reliability or validity. The focus is on the
ratings; and Obtained Reinforcement, which is relationship of pleasant events to various behav-
the overall sum. Representative PES items are: ioral issues, rather than concern with the psycho-
going to a rock concert, wearing clean clothes, metric aspects of the scale.
being at the beach, playing golf, and sitting in the Lewinsohn also developed an Unpleasant
sun. Events Schedule, a 320-item scale somewhat par-
Test-retest reliability for a sample retested over allel to the PES, but designed to assess stressful
a 4 to 8 week period was .85, .66, and .72 for the life events (Lewinsohn, Mermelstein, Alexander,
three scores; alpha coefficients were .96, .98, and et al., 1985).
.97 (Hersen & Bellack, 1981). Hersen and Bellack (1981) indicate that such
Evidence for the validity of the instrument activity schedules, and the PES in particular, are
has been presented by Lewinsohn and his col- good examples of behavioral measures that have
leagues in studies where the PES discriminates been subjected to careful psychometric develop-
between depressed individuals and psychiatric ment and analysis. Unfortunately, most of the
or normal controls, and studies where various data that would be of interest from a psychome-
psychotherapeutic interventions with depressed tric point of view (e.g., reliability, results of fac-
patients result in parallel changes in the PES. tor analyses, etc.) is in unpublished manuscripts
Most of the validity would then be considered rather than in the public domain.
construct validity. For example, Lewinsohn and
Graf (1973) found that the number of pleasant
Social Skills
activities engaged in was significantly related to
mood ratings of depression (more activities, less One of the major areas of deficit studied by behav-
depression), and depressed subjects engaged in ior modification has been that of social skills, gen-
fewer pleasant activities. The 49 PES items that erally involving the ability to express both positive
were significantly related to mood seemed to fall and negative feelings appropriately in interper-
into three categories: (1) incompatible affects sonal situations. The label “social skills” stands
(e.g., laughing, being relaxed); (2) social inter- for a variety of behaviors that usually have an
actions (e.g., kissing, being with friends); and interpersonal component (i.e., relating to oth-
(3) ego supportive (e.g., doing a job well, driving ers) and an evaluative component (i.e., how effec-
skillfully). tive the person is in relation to others). From a
Bouman and Luteijn (1986) took the 49 PES behavioral-assessment stance, we might perceive
items from the above study and used them as a social skills as the abilities to emit behaviors that
scale in a sample of Dutch psychiatric patients, are positively reinforced by others and not emit
who also completed other questionnaires. Scores behaviors that are punished by others. Among
on this PES Mood related scale correlated signif- these skills, assertiveness and heterosexual dat-
icantly with other questionnaires of depression ing skills have been the focus of much research
and of neuroticism; for example, −.58 with the and many self-report scales (for an early review of
Beck Depression Inventory, −.46 with the State several self-report techniques used in the social-
Anxiety and −.39 with the Trait Anxiety scores skills area, see Hersen & Bellack, 1977).
of the Spielberger State-Trait Anxiety Inventory.
However, when amount of depression was sta-
Assertiveness
tistically removed, the correlations between PES
and nondepression scores became insignificant – Assertiveness involves the direct expression of
i.e., the PES Mood scale is related to depres- one’s feelings, preferences, needs, and opin-
sion and not other variables. When the sample of ions. Assertiveness is thus positively related
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 497

to social dominance and negatively related to In a second study with 47 college coeds, their
“abasement.” It is different from aggression, in RAS scores were compared with their audiotaped
that aggression involves threats or punishment. responses to five vignettes that could be resolved
Assertiveness seems to be a major problem for by assertive behavior. These responses were rated
many people, and especially from a behavioral- blindly by judges who showed high interrater reli-
assessment point of view, there is a need for reli- ability (r = .94). RAS scores and ratings from
able and valid instruments for use with adult these audiotaped responses correlated .70.
samples, and a number have been proposed (e.g., An item analysis indicated that 27 of the items
McFall & Lillesand, 1971; Rathus, 1973). Most of correlated significantly with the total RAS score,
assertiveness inventories ask the subject to indi- and 19 of the 30 items correlated significantly
cate whether or not they engage in specific behav- with the semantic differential items indicating
iors. Thus the questions are typically quite spe- assertiveness. However, Rathus, 1973 suggested
cific and are therefore less likely to be distorted retaining all 30 items because they provide useful
by general attitudes and more likely to be directly information.
related to the outside criterion. Concurrent validity of the RAS is high. For
example, in one study (Rathus & Nevid, 1977)
scores for a sample of psychiatric patients cor-
related .80 with therapists’ ratings of assertive-
The Rathus Assertiveness Schedule (RAS)
ness using a semantic differential scale. Discrim-
Rathus (1973) felt the need for an instrument to inant validity also is good. For example, the RAS
measure behavioral change in assertion training differentiates between assertive and nonassertive
programs, and presented a 30-item Assertiveness male and female students (Nietzel, Martorano, &
Schedule, which subsequently became quite pop- Melnick, 1977).
ular in the assertiveness literature.
The subject is asked to rate each of the 30
Adult Self-Expression Scale (ASES)
items on a 6-point scale, ranging from “very char-
acteristic of me, extremely descriptive” to “very Gay, Hollandsworth, and Galassi (1975) set out
uncharacteristic of me, extremely undescriptive.” to develop an assertiveness scale. They first devel-
The items cover such issues as whether the per- oped a blue-print to guide their efforts by describ-
son complains in a restaurant if the service is poor ing assertiveness along two dimensions. One
or the food is not prepared satisfactorily, whether dimension specified interpersonal situations in
the person has difficulties saying no, and whether which assertive behavior might occur, such as in
he or she finds it embarrassing to return mer- interactions with parents or with friends; six such
chandise. The items were obtained from other situations were specified. The second dimension
scales, as well as from students’ diaries. specified assertive behaviors that might occur
Test-retest reliability for a sample of 68 college in these interpersonal situations such as refus-
students with a time interval of 8 weeks, was .78. ing unreasonable requests or expressing nega-
Odd-even reliability for a sample of 67 subjects tive feelings; seven such behaviors were specified.
was .77. Split-half reliability coefficients reported This 6 × 7 format became a specification table for
in the literature generally range from .73 to .91 item selection.
for various subgroups of patients. A pool of 106 items, most coming from other
Validity was originally established by compar- scales, were administered to 194 adult subjects
ing RAS scores to semantic differential ratings attending a community college. Items were then
given by acquaintances. Five of the ratings cov- retained on the basis of item analyses which
ered an assertiveness factor, and these ratings cor- looked at item discrimination and item-total cor-
related with RAS scores as follows: boldness .61; relations. Forty-eight items were retained that
outspokenness .62; assertiveness .34; aggressive- covered 40 of the 42 cells in the specification table.
ness .54 confidence .33. The ASES uses a 5-point Likert response for-
Semantic differential items such as smart- mat, with 25 of the items positively worded and
dumb, happy-unhappy, did not correlate signif- 23 negatively worded. Test-retest reliability coef-
icantly, indicating that RAS scores are not con- ficients for 2-week and 5-week intervals are .88
founded by social desirability. and .91. In another study, test-retest reliability
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

498 Part Five. Challenges to Testing

with a 1-week interval ranged from .81 to .89, developed the CABS designed to assess not only
and odd-even reliability was .92 (Hollandsworth, children’s assertive behavior, but also their pas-
Galassi, & Gay, 1977). sive and aggressive modes of responding. The
Construct validity was established by corre- authors first identified various content areas of
lating scale scores with ACL scores and with assertive behavior. Items were then developed for
other measures of anxiety and of locus of con- each content area and were examined by grad-
trol. The pattern of results supported the con- uate students for their content validity. Other
struct validity of the scale. For example, high scor- judges were then asked to assess the validity of
ers described themselves more favorably, as more the “continuum of responses” for each item (e.g.,
self-confident and spontaneous, more achieve- “is choice D more aggressive than choice B on the
ment oriented, and more often seeking leader- sample item below?”). Mean percentage of agree-
ship roles. A factor analysis suggested 14 factors ment was 94%.
that the authors interpreted as supporting the The CABS is composed of 27 multiple-choice
two dimensional descriptive model used in the items, where the stem sets up a situation, followed
construction of the scale. by five response options. For example: “You are
Hollandsworth, Galassi and Gay (1977) working on a project and someone says ‘what are
reported two validational studies of the ASES you doing?’ Usually you would reply: ”
using a multitrait-multimethod procedure. In
−1 (a) Nothing really.
the first study, three samples (adults attend-
ing a technical institute, students in a graduate +1 (b) Don’t bother me. Can’t you see I’m busy?
course in counseling, and psychiatric inpatients)
−2 (c) Continue working and ignore the person.
were administered the ASES, the ACL, and an
observer-rating scale on assertiveness. In the sec- +2 (d) It’s none of your business.
ond study, convicted male felons were adminis-
0 (e) You would stop and explain what you were doing.
tered the ASES, the Rathus Assertiveness Sched-
ule, and a self-report of aggressiveness; their The scoring weights given above would of course
criminal records were also rated for previous not appear on the test protocol. Of the five
assaultive behavior. responses, one is very passive (a), one is passive
Scores on the ASES correlated positively with (c), one is assertive (e), one is aggressive (B) and
dominance (.35 to .68) and negatively with abase- one is very aggressive (d). Each of the response
ment (−.46 to −.72) across all samples. Scores options is given a scoring weight, and three scores
on the ASES also correlated significantly with are generated: a passive score that is the sum total
other measures of assertiveness; thus, convergent of the minus answers, an aggressive score that is
validity using the same method of assessment was the total of the positive scores, and a total score
supported. However, when convergent validity as that is the absolute value of the passive and aggres-
assessed by different methods was analyzed, the sive scores.
results for the relationship between assertiveness Significant concurrent validity was obtained
and abasement supported the hypothesis, but with ratings of social competence by peers, par-
the results for the relationship between assertive- ents, and teachers. The CABS also discriminated
ness and dominance were significant for only children who had participated in social-skills
one sample. Finally, the analysis of discriminant training vs. those who had not.
validity only partially supported the hypotheses. Hobbs and Walle (1985) asked a sample of pri-
In general however, the authors concluded that mary school children to complete some socio-
the results do support the utility of the ASES as a metric procedures in which they indicated their
measure of assertive behavior. three best friends, as well as three children
whom they would most wish to be like (i.e.,
admired). The children also completed the CABS,
Children’s Assertive Behavior Scale (CABS)
which was scored for passivity, aggressiveness,
There are many self-report measures of assertive- and assertiveness. Children who scored high on
ness for use with adult subjects, but relatively positive peer nominations responded less aggres-
few for children. Michelson and Wood (1982) sively than children who received low scores on
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 499

peer nominations. Black children showed greater The Fear Survey Schedule (FSS)
aggressive responding than white children, and
There are various forms of the FSS that range
boys were more aggressive but less assertive than
in items from 50 to 122. The first form of the
girls. The findings were significant only for the
FSS, FSS1 was a list of 50 common fears to be
measure of aggression and not the measure of
rated by subjects on a 7-point scale (Lang &
passivity, but the authors concluded that the find-
Lazovik, 1963). The items were developed on
ings supported the validity of the CABS.
the basis of subjective judgment (i.e., the author
chose the items), as part of a doctoral dissertation
Sensitivity. One major line of evidence for the
(Akutagawa, 1956). The focus was a clinical one;
validity of such assertiveness questionnaires is
the FSS was designed to provide the therapist with
their sensitivity to changes in behavior as a func-
a brief overview of the fears a client might have.
tion of therapeutic interventions. Such studies
FSS2 was constructed empirically for research
usually use a pre-post format, where the ques-
purposes by selecting 51 items from open-ended
tionnaire is administered to a sample of subjects,
questionnaires, with considerable psychometric
typically clients seeking help; then the clients
data (Geer, 1965). FSS3 was developed within
are given behavior therapy, and are reassessed.
a clinical setting and consisted of a large num-
In general, such studies support the validity of
ber of items (the literature cites from 72 to 108)
assertiveness scales.
to be used for clinical purposes (Wolpe & Lang,
Other studies look at the concurrent validity of
1964; 1977). Other forms representing different
such scales, by comparing scale scores with scores
combinations of items have also been presented
on other questionnaires, or look at discriminant
in the literature (e.g., Braun & Reynolds, 1969;
validity by comparing assertive vs. nonassertive
Lawlis, 1971; Tasto & Hickson, 1970; Wolpe,
individuals. Still other studies have looked at the
1973).
factor structure of such questionnaires.
The FSS has been used in many studies,
but because different forms that have different
Criticism. One of the problems with many
degrees of overlap have been used (e.g., FSS2 and
assertiveness scales is that they confound
FSS3 overlap by 20 items), it is difficult to gener-
assertiveness and aggressiveness, i.e., scores on
alize the results.
the assertiveness scale correlate not only with
Different FSS forms have been factor analyzed
indices of assertiveness, but also with indices of
and, not surprisingly, different numbers of fac-
aggression (e.g., DeGiovanni & Epstein, 1978).
tors have been obtained – sometimes as few as
3 and sometimes as many as 21. Sometimes the
Phobias factors were replicated on a second sample and
sometimes they were not. Sometimes different
It is estimated that in the United States more than
factors emerged for males than for females. Many
15 million individuals are affected by phobias at
studies have used the ubiquitous college students
any given time, which makes phobias the most
taking introductory psychology, but some have
common “mental disorder.” Thus a number of
used psychiatric samples, with different results.
scales have been developed to measure such pho-
Tasto (1977) suggested that if we consider only
bias.
studies of college students and go beyond the
different factor labels that different authors use,
The fear thermometer. One of the first self-
there is a fair amount of agreement across studies,
report measures of fear was developed in the
with four major factors present: (1) fears related
context of evaluating parachute jumpers (Walk,
to small animals; (2) fears related to death, phys-
1956). This was called the fear thermometer in
ical pain, and surgery; (3) fears about aggression;
which a thermometer like figure was divided into
and (4) fears of interpersonal events.
10 equal parts; the trainee was asked to indicate
his fear of jumping from a 34-foot mock tower,
by placing a mark on the fear thermometer, with Reliability. Hersen (1973) pointed out that few
the height of the mark indicating the degree of studies have looked at reliability, but those
fear experienced. that have, have reported substantial reliability.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

500 Part Five. Challenges to Testing

Internal-consistency estimates where reported, number of studies in which the FSS is used with
seem to be high – typically in the low to mid .90s. experimental and control groups, and a pre-post
Test-retest reliability is lower but still acceptable; treatment design. If one is interested in the effi-
typical coefficients reported are in the mid .70s cacy of the treatment, we can assume that the
to mid .80s, with test-retest intervals of 1 to FSS is valid, and we can use significant pre-post
3 months. changes in scores in the experimental group as
evidence that the treatment works. We can also
Validity. A behavioral assessment procedure consider such sensitivity as evidence that the FSS
such as the FSS was developed because other is valid (see Tasto, 1977, for discussion of such
instruments, for example the MMPI, were seen studies).
as not adequate from either a theoretical point A number of factor-analytic studies have been
of view or an applied point of view. To then done, with the results reflecting the composition
turn around and correlate the FSS with such (i.e., the number and type of items) on the par-
measures would seem somewhat strange. If low ticular scale being analyzed. A number of authors
correlations are obtained, one could argue that have suggested that factor scores be used rather
the instrument lacked construct validity, even than subtotal scores of discrete items.
though construct validity does not seem as ger- One of the major problems, especially with
mane to behavioral assessment techniques as it FSS3, is that it generates a large number of false
does to more traditional tests. If high correla- positives. Klieger and McCoy (1994) developed
tions were obtained, then one could argue that a modified version of the FSS3 by decreasing the
the new measure was superfluous. In fact, studies ambiguity of the items, including specific anchors
that have looked at such correlations find coeffi- for each item, and including instructions that
cients in the mid-range, for example, correlations separated emotions, such as disgust, from fear.
in the .40s with measures of anxiety in samples They then administered the modified FSS3 to
of therapy clients and/or college students. undergraduate students. To evaluate the scale,
From a behavioral point of view, validity is a behavioral-avoidance task (or BAT) was used.
best established by comparing verbal reports to For example, one such BAT consisted of a harm-
overt behavior and/or physiological indices. The less snake housed in a terrarium. The student
fear thermometer mentioned above showed that walked toward the terrarium and stopped when
parachute trainees who passed the course rated he or she felt anxious. The floor was marked
themselves as less fearful than those who did not. in 1-foot increments, so the distance traveled
Within those who passed, lower fear ratings were could easily be determined. As an example of
related to achieving correct jump techniques ear- the results, responses on the modified FSS3 cor-
lier in training. Fazio (1969) validated a fear sur- related .33 for females and .64 for males with
vey with actually picking up a cockroach. scores on the BAT that involved approaching the
Geer (1965) conducted a number of valida- closed terrarium, removing the lid, and touch-
tional studies of the FSS. In one, subjects were ing an upright log. In an earlier study (Klieger
required to approach a German Shepherd. For & Franklin, 1993), no significant differences on
females, scores on the FSS were related to behav- BATs were obtained between high-fear and no-
ioral indices such as distance and latency (i.e., fear students. For example, all 20 no-fear stu-
how close and how fast they approached the dog), dents approached the snake terrarium, as did
but not for males. In another study, latency was 17 of the 20 high-fear students. A number of
longer for low-fear subjects than for high-fear studies have indeed found low correlations (e.g.,
subjects, contrary to what one would predict. in the .20s and .30s) between FSS scores and
In general, many of the results reported in the various behavioral indices. Thus, cognitive and
literature suggest that great caution needs to be behavioral indices of fear are independent of each
exercised as there may well be confounding fac- other, and it may well be that the FSS is not
tors present when one attempts to validate such an effective classificatory tool. There may also
measures. be a number of confounding aspects, such as
Because scales such as the FSS are used in the demand characteristics of the experimental
the context of behavioral treatment, there are a situation.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 501

Gender differences. One of the consistent find- three-factor model was not appropriate for either
ings is that of gender differences, with females sample; exploratory factor analysis indicated a
reporting higher degrees of fear than males. One four-factor solution for the college sample, and a
argument to account for this difference is that five-factor solution for the community sample.
social desirability inhibits males from admitting H. B. Lee and Oei (1994) investigated the appli-
to fears, whereas in females social expectations cability of the FQ in a sample of Chinese col-
are of greater fear. lege students in Hong Kong. Coefficient alpha
for the total score was .83. Individual items cor-
Normative data. A variety of studies report nor- related with total score from .15 to .59, with a
mative data, typically in the form of means, and mean r of .45. A factor analysis indicated three
often separately by gender because women rate major factors that accounted for 37% of the vari-
themselves as more fearful than men. In addi- ance. Factor 1, labeled agoraphobia, consisted of
tion z score equivalents are also available (Tasto 5 items such as “walking alone in busy streets”
& Hickson, 1970), as well as an MMPI-like profile and “traveling alone by bus or coach.” Factor 2,
sheet that allows a person’s scores on five FSS fac- social phobia, consisted of items such as “speak-
tors to be plotted using T scores (Tasto, Hickson, ing or acting to an audience,” and “talking to
& Rubin, 1971). people in authority.” Factor 3, blood-injury pho-
bia, contained items such as “injections or minor
surgery” and “hospitals.”
The Fear Questionnaire (FQ)
No gender differences were obtained on these
Another popular measure is the FQ, a brief 15- three factors and modest correlations with vari-
item instrument that requires the subject to rate ous measures of anxiety. The three-factor scales
his or her avoidance of specific situations (I. M. were found to be correlated with each other, con-
Marks & Mathews, 1979). Typical items are: sight trary to the findings of other studies that the
of blood, going into crowded shops, being criti- three factors are independent. Quite clearly, there
cized. Three subscales are available blood/injury, may be significant cultural differences. Another
agoraphobia, and social, as well as a total score. obvious source of variation may be the nature
This scale is probably one of the most frequently of the sample; healthy college students are differ-
used self-report instruments in the treatment of ent from neurotic patients seen in a clinic, and it
agoraphobia (Trull, Neitzel, & Main, 1988). In is therefore not surprising that different studies
clinical samples, the FQ has excellent test-retest yield different results.
reliability, normative data is available (Mizes &
Crawford, 1986; Trull & Hillerbrand, 1990), and
PROGRAM EVALUATION
the scale has been translated into a number of
languages including French, Italian, and Spanish Tests can also be used to evaluate programs rather
(H. B. Lee & Oei, 1994). than individuals. There are many ways to do
Trull and Hillerbrand (1990) studied a sam- this. We could, for example, ask individuals to
ple of college students and a volunteer sample complete rating scales that focus on a particular
of community adults (surveyed by phone). For program. For example, if you have ever evalu-
the collegiate sample, females scored higher than ated a course by filling out a questionnaire, you
males on the blood/injury and agoraphobia sub- were basically participating in a program evalu-
scales, as well as the total phobia score. For the ation where you were one of the evaluators. We
community sample, females scored higher on all could also administer questionnaires to individ-
three subscales, as well as the total score. Com- uals, typically in a pre- versus post-design to eval-
munity subjects scored higher than college stu- uate the effectiveness of a program. If we wish to
dents on the social and agoraphobia subscales, evaluate the effectiveness of a therapeutic pro-
as well as on the total score. In both samples, gram, we can assess the mental health of the
the three subscales intercorrelated significantly, patients before entering the program, and then
with coefficients from .28 to .48. Internal consis- again some 6 months later. Here the focus would
tency (Cronbach’s alpha) ranged from .44 to .73. not be on how psychologically healthy the person
A confirmatory factor analysis indicated that the is at post-test, but rather on the changes that have
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

502 Part Five. Challenges to Testing

occurred, presumably reflective of the effective- mindful of the psychometric requirements of the
ness of the program. Sometimes, tests are mis- instruments used.
used in this way. The teaching effectiveness of a
school and its staff is often evaluated based on
ASSESSMENT OF ENVIRONMENTS
how the children do on a standardized achieve-
ment battery, without considering the fact that
Measuring the Home Environment
the scores on these tests may be more highly
related to the socioeconomic background of the You recall that Public Law 91-230, the Handi-
children than to the expertise and efforts of the capped Children’s Act (Chapter 12) mandated the
teachers. establishment of preschool and early-childhood
National programs such as Head Start have programs for disabled and high-risk children.
had massive evaluations, although the conclu- Thus, these children need to be identified, and
sions derived from such evaluations do not nec- there is a need for instruments. Traditional intel-
essarily reflect unanimous agreement (see W. L. ligence tests such as the Wechsler or the Stanford-
Goodwin & Driscoll, 1980). Binet can be quite useful, but “at risk” involves
Program evaluations can vary from rather sim- more than simply cognitive deficit. There is a
ple approaches where a single rating scale is used, need to identify environmental aspects that can
as in the typical course evaluation conducted either directly affect development, or can inter-
on collegiate campuses, to complex multiphasic act with other variables such as intelligence, to
approaches that use cognitive measures, assess- produce greater developmental delay. Scientists
ment of psychomotor changes, economic indica- often use socioeconomic status (SES) as an index
tors, direct observation, use of archival data, and of environmental quality. Although SES works
so on. Thus, in evaluating a graduate program and is used widely, it is also a fairly limited index.
in clinical psychology, the evaluating team looks Within an SES grouping, for example, families
not only at information such as the publications differ widely in the kinds and amounts of stimu-
of the faculty, their rank, their teaching evalua- lation they provide their children.
tions, etc., but also interviews graduate students, Caldwell and her colleagues (e.g., Bradley &
deans and other administrators, obtains archi- Caldwell, 1977, 1979) pointed out that there is
tectural information about the amount of space a need to develop screening and diagnostic pro-
available, library resources, and so on. cedures that identify children at risk. A child’s
An excellent example and case study is the eval- developmental environment can be hypothe-
uation of Sesame Street, an extremely popular sized to play a major role, and thus should be
children’s television program geared for children included as a major assessment dimension. In
ages 3 to 5. The program was originally developed particular, a child’s developmental environment
as a remedial program, especially for disadvan- can be indexed by process measures, that is the
taged children, to prepare them for school. The day-to-day, moment-to-moment transactions of
intent was not only to entertain but to increase the child with the environment – the type of
intellectual and cultural development (Ball & language stimulation received, the kind of play
Bogatz, 1970). The first year of Sesame Street materials available, the types of punishment and
was evaluated through a large sample of children rewards, etc. They therefore developed the Home
that included disadvantaged inner-city children, Observation for Measurement of the Environ-
Spanish-speaking children, and rural-area chil- ment (HOME) Inventory, which consists of items
dren. A special battery of tests was developed (different numbers in different versions) that
for this program evaluation, basically consisting cover such aspects as the emotional and verbal
of 12 major tests (such as Numbers and Sort- responsivity of the mother to the child, whether
ing Skills) that took an average of 2 hours to there are appropriate play materials, and so on.
administer. The alpha coefficients for the sub- The HOME inventory is administered in the
tests ranged from .17 to .93, with a median of .66 home by a trained observer and requires about
(see W. L. Goodwin & Driscoll, 1980, for a syn- 60 minutes.
opsis). To the extent that psychological tests are Bradley and Caldwell (1979) presented valid-
used in such program evaluations, we need to be ity data for the preschool version of the HOME
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 503

Table 18–2. Use of HOME Inventory in a Sample of the amount of crowding in


91 Families the home, were significant
Predicted group membership based on
(the more crowded the home,
discriminant function of HOME Inventory the lower the child’s score).
HOME scores were substan-
Actual IQ group Low Low average Average to superior
tially related to Stanford-
Low 12 5 0 Binet scores.
Low average 9 14 11
Average to superior 7 8 25
A discriminant analysis of
the HOME inventory in rela-
tion to Stanford-Binet scores
Inventory. This version was designed to assess indicated that low IQ (below 70) was associated
the quantity and quality of the social, emotional, with poorer organization of the environment,
and cognitive support available for children aged fewer appropriate play materials available, and
3 to 6 within their home. Originally, all the less maternal involvement with the child. Table
items developed for the HOME Inventory were 18.2 presents the results for a sample of 91 fami-
based on direct observation of actual transactions lies, where the families were divided into low IQ
between mother and child. As the inventory was child (IQ below 70), low average (70 to 89), and
refined, however, additional items based on inter- average to superior (90 and above). Note that the
view data were added. Several forms were pro- sensitivity of the HOME Inventory for the low IQ
duced, including a 144-item version, an 80-item group is high with 71% (12 of 17) correctly iden-
version, and a 55-item version, with items scored tified. There is less sensitivity for the low aver-
as present or absent. age group, where 14 of 34 or 62% are correctly
The 80-item version was administered to 232 identified. For the average to superior group, 25
volunteer families, including both black and of 40 or 62.5% were correctly identified. Speci-
white participants. A factor analysis indicated 7 ficity within the low IQ group (the number of
factors, and items that loaded significantly on homes incorrectly identified as being associated
one of these factors were retained, as well as 9 with low IQ, i.e., false positives) is high. Thus,
items that correlated with academic achievement, the procedure identifies 28 as low IQ, with 16
yielding a 55-item scale. Some of the factors were being incorrect. By comparison, the specificity
(1) stimulation through toys, games, and read- associated with average to superior IQ is good
ing materials; (2) language stimulation; and (3) – 25 of 40 were correctly identified, so 15 were
stimulation of academic behavior. An item anal- false negatives. When screening for retardation,
ysis indicated that item correlations with total it can be argued that sensitivity is more important
score or total subscale score were moderate to than specificity – that is, we really want to iden-
high (.20 to .70). In most instances, the correla- tify those children who do need special and/or
tion between an item and its subscale score was remediation programs.
greater than the correlation between an item and Bradley, Caldwell, and Elardo (1977) studied
the total scale score. the ability of the HOME inventory and SES to
Internal consistency coefficients (KR 20) for predict IQ, and found that across both white
the subscales ranged from .53 to .83. Test-retest and black children, as well as male and female,
stability over 18 months ranged from .05 to .70. the HOME inventory was a more accurate index
However, these are not really reliability coeffi- of environmental quality, and predicted IQ bet-
cients, but may in fact reflect the kind of changes ter than did SES, with correlation coefficients
that occur in the home over such a time span. between .58 to .79.
Validity was assessed by correlating HOME
subscale scores with indices of socioeconomic
Measuring Classroom Environments
status (such as maternal and paternal educa-
tion, and occupation), and Stanford-Binet IQ. It is generally believed that the classroom, espe-
In general, correlations with mother’s or father’s cially in the primary grades and the high-school
occupation was negligible, but correlations with years is a critical setting not only for educa-
mother’s and father’s education, as well as with tional development but also for psychosocial and
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

504 Part Five. Challenges to Testing

interpersonal development. It is also believed that schools, along with the Marlowe-Crowne Social
there are individual differences between class- Desirability Scale (see Chapter 16). All of the
rooms, and that classrooms do have distinct cli- items showed very low correlations (.20 or less)
mates or atmospheres. Thus, there has been a lot with scores on the Marlowe-Crowne, but item
of research on classroom climate. analyses showed that some items either did not
correlate with dimension scores, or were too
highly correlated with other items. After a num-
The Classroom Environment Scale (CES)
ber of statistical analyses were carried out, and
Moos and his colleagues (e.g., Moos 1968; 1972; additional observations made, a 208-item form
Moos & Houts, 1968; Trickett & Moos, 1970), was developed and administered to 443 students
have devoted considerable effort to the mea- in 22 different classrooms.
surement of environments in a variety of set- Most of the items (89%) discriminated
tings, such as university dormitories, psychi- between classrooms, and most of the items cor-
atric wards, and high-school classrooms. Trickett related highly with their subscale scores. This
and Moos (1973) developed a Classroom Envi- and other statistical considerations led to the
ronment Scale (CES) to assess the psychosocial development of a 90-item form that covers nine
environment of junior high- and high-school dimensions (such as affiliation, competition, and
classrooms. They began by analyzing the liter- teacher control) with 10 items per dimension.
ature on the topic, and based on the theoreti- The internal consistency (Cronbach’s alpha) for
cal notions found in the literature, decided there these nine dimensions ranged from .67 to .86,
were three major sets of variables: (1) inter- showing acceptable internal consistency. All 9
personal relationship variables, which include scales differentiated the 22 classrooms, with most
affective aspects of student-student and teacher- of the scales having relatively low intercorrela-
student interactions; (2) structural aspects of the tions with each other (average was .26). Two of the
classroom; and (3) goal-orientation variables. classrooms were retested after a 2-week period,
Given this theoretical framework, the authors with test-retest correlations ranging from .91 to
then searched the educational literature, con- .98. The CES then, provides something like a per-
ducted structured interviews with both teach- sonality profile, not of a person but of a class-
ers and students, and observed a wide range room, as perceived by the respondents.
of classes. This led to the identification of con-
ceptual dimensions (e.g., student involvement)
The Individualized Classroom
and to the writing of test items presumably
Environment Questionnaire (ICEQ)
indicative of each conceptual dimension. Note
again the close correspondence between the steps In the 1980s, education within Great Britain
taken and those outlined in Chapter 2. Although underwent a number of dramatic changes that
having a strong basement doesn’t mean that the created and accompanied a shift in thinking –
rest of the structure will necessarily be solid, a from a traditional approach that explained
strong foundation is a necessary first step. scholastic success and failure as located “within
Once the original pool of items was con- the child” to explanations that concentrate on
structed, items were independently evaluated as evaluating the total context in which learning is
to conceptual dimension by two raters. If the expected to occur – that is, an evaluation of the
raters did not agree, the item was discarded. The classroom environment.
initial version of the CES consisted of 242 items Fraser (Burden & Fraser, 1993; Rentoul &
representing 13 conceptual dimensions. On each Fraser, 1979) developed the Individualized Class-
dimension, approximately half of the items were room Environment Questionnaire in Australia,
keyed true and half were keyed false. For each item and cross-validated the questionnaire in British
(e.g., “The teacher is very strict”), the respondent schools. There are two forms of the question-
answers true or false depending on whether the naire (presumably using the same items): one is
statement is seen as a general characteristic of the the actual form where the child rates the actual
classroom. classroom, and a “preferred” form where the
This initial version was given to 504 stu- child indicates what he or she would prefer. The
dents in 26 different classrooms, in seven high forms can also be completed by the teacher so
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 505

a total of four combinations can be obtained. differences between classrooms that differ in class
Representative items are, “the teacher talks with size, type of school, and so on.
each student,” and “there is classroom discus-
sion.” Each item is responded to on a Likert-like College environments. Pace and Stern (1958)
scale of 1 (almost never) to 5 (very often). The 25 began with the theoretical framework proposed
items are divided into five subscales: Personaliza- by Henry Murray, whose personality theory was
tion, Participation, Independence, Investigation, based on the notion of personal needs (motives,
and Differentiation (i.e., whether different stu- goals, etc.) as well as environmental press, i.e.,
dents use different materials or engage in differ- aspects of the environment that impinge upon
ent projects). human behavior. They constructed the College
Burden and Fraser (1993) present alpha relia- Characteristics Index, composed of 300 true-false
bility coefficients on children in eight classrooms. items organized into 30 ten-item subscales. The
These coefficients range from .62 to .85, with 6 scales basically reflect needs such as need affil-
out of the 10 coefficients above .70. In a study iation, but these are presses (statements about
of 34 classrooms in Australia, alpha coefficients the college environment); in this case, the aspects
ranged from .61 to .90, indicating satisfactory of the college environment that would promote
reliability. The scales themselves intercorrelate affiliation. Respondents answer true or false if
from .09 to .36, suggesting adequate discrimi- they believe the item is or is not characteris-
nant validity. Test-retest reliability is from .67 tic of this particular college environment (e.g.,
to .86. Several studies of predictive and criterion “Spontaneous student demonstrations occur fre-
validity support the initial validity of the ICEQ quently” might reflect the press of impulsive-
(Fraser, 1981). A comparison of the ICEQ with ness).
scores on a locus of control scale indicated small The initial instrument was given to a sam-
but significant correlations with three of the five ple of 423 college students and 71 faculty mem-
scales. The authors also give a case study where bers at five different collegiate institutions. Dif-
the actual and preferred forms were administered ferences between student and faculty responses
in one science class, with appreciable difference were minor, and rank order correlations across
between the two. The teacher then made a num- the 30 presses for students vs. faculty correlated
ber of changes based on the students’ comments, .88 and .96, showing substantial agreement. Case
and 4 weeks later the forms were readministered, studies of different institutions are presented to
showing substantial changes. support the validity of the instrument (which was
Classroom environments have been assessed in subsequently revised).
terms of student or teacher perceptions of psy- Thistlethwaite (1960) took a revised form of
chosocial dimensions such as cohesiveness and the College Characteristics Index and studied a
degree of competition; see Moos (1979) for an sample of 1,500 National Merit Scholars, college
introduction to the assessment of educational students who had done extremely well on the
environments. A number of popular instruments SAT. He found that stability of students’ study
in this area exist in addition to the Classroom plans were related to such aspects as role models
Environment Scale (Moos & Trickett, 1974), such for imitation. Decisions to seek advanced degrees
as the Learning Environment Inventory (G. J. were often based on the perception of faculty as
Anderson & Walberg, 1976). In general, when enthusiastic and informal and on their stressing
classroom environment perception is used as achievement and independence. (For a different
independent variable, there is considerable valid- approach to the measure of college environments,
ity in that this variable (and its components) see Astin & Holland, 1961.)
accounts for an appreciable amount of the vari-
ance in cognitive and affective outcomes (such
Measuring the Physical Environment
as degree of learning and customer satisfaction),
often more than what can be attributed to stu- Rating the taste of water. From a psychological
dent characteristics, such as their general abil- testing perspective, the assessment of the psycho-
ity (Fraser, 1981). When used as a dependent logical components of a physical environment
variable, classroom environment perception also are of greater interest than assessing the phys-
shows validity in that such perception shows ical components. Take water for example. We
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

506 Part Five. Challenges to Testing

could measure its mineral content, degree of pol- 1. Nature: this scale deals with the preference for
lution, viscosity, etc., but we usually leave such woodland areas, lakes, wilderness, and other nat-
measurements to physical scientists. But the psy- ural settings;
chological perception is another matter. Bruvold 2. Romantic escape: this scale also reflects a pref-
(1968) presented four rating scales to evaluate erence for natural settings, but adds a dislike for
the taste of water. These scales were developed urban places;
using the Thurstone method of equal-appearing 3. Modern development: this one reflects a pref-
intervals, and for each of the scales, nine items erence for modern housing developments and
were retained. The first scale, called the “hedo- industrial areas;
nic” scale, consists of nine items that range from,
4. Suburbs: here the focus is on ownership of
“I like this water extremely” to “I dislike this
property and law and order;
water extremely.” The second scale, the “qual-
ity” scale, consists of nine items that range from, 5. Social: this scale shows preference for people
“This water has an excellent taste” to “This water activities such as parties and conversation;
has a horrible taste.” The third scale is the “action 6. Passive reaction to stress: here the preference
tendency” scale, and its nine items range from, for sleeping, eating, pleasant smells like perfume
“I would be very happy to accept this water as is indicated;
my everyday drinking water” to “I can’t stand 7. City: this scale expresses a preference for bustle
this water in my mouth and I could never drink and excitement of a large city.
it.” Finally, the fourth scale, called the “combina-
tion” scale, has nine items that range from, “This The EPQ consists of six questions or areas of
water tastes real good. I would be very happy concern; for example, question 1 asks the respon-
to have it for my everyday drinking water” to dent to indicate their preference for each of 11
“This water has a terrible taste. I would never settings, such as “the bustle and excitement of a
drink it.” large city” and “a woodland area.” Responses are
Reliability of these scales was assessed in a labo- made using a 5-point agree-disagree scale. A total
ratory study where 24 employees of the California of 60 items are thus rated. The seven EPQ scales
State Department of Public Health rated 10 water range in length between 4 and 12 items. Internal
samples, with order of water sample presentation consistency alpha coefficients range from .63 to
randomized, as well as a number of other exper- .83, with five of the scales falling in the .70s.
imental controls. For each subject there were 28 In a study of high-school students, R. Kaplan
reliability correlation coefficients computed, but (1977) found some significant but modest cor-
the average intercorrelations between and within relations between EPQ scales, self-esteem, and
scales ranged from .62 to .77. The validity of the motivations for different recreation activities,
scales was assessed by comparing mean ratings supportive of the construct validity of the scale. It
for samples of respondents from different towns, could be argued, however, that this scale is really
and by showing an inverse relationship between a personality inventory. After all, saying that “I
favorability toward water and mineral content of like a bright sunny day” vs. “I like collecting pine
the water. cones” could easily be items from an MMPI- or
CPI-like instrument.

Environmental preferences. People differ in


ASSESSMENT OF FAMILY FUNCTIONING
their preferences for different environmental set-
tings. The Environmental Preference Question- The majority of tests considered in earlier chap-
naire (EPQ; R. Kaplan, 1977) was designed to ters have the individual as their focus. That is,
explore individual differences in such environ- whether we administer an intelligence test, a per-
mental preferences. The 1977 form of the EPQ sonality inventory, a checklist of values, a measure
contains seven scales developed on the basis of of creativity, etc., we are assessing an individual.
responses made by several hundred people, and Tests can also be used to assess the psychoso-
as a result of various factor-analytic procedures. cial and/or physical environment, such as a col-
The seven EPQ scales are described next: lege campus as we discussed earlier, or as we will
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 507

see next, the relationships between individuals as L. Q. Bell (1989) for example, argue that “micro”
family members or the family as an organized coding of smaller units of behavior is always bet-
system. ter than “macro” coding inferences made from
The earliest attempts at assessment of couples the behavior. Much of family research is based
were primarily based on self-reports, and focused on self-report, however (for a review of meth-
on such aspects as satisfaction and marital adjust- ods to study families, see B. C. Miller, Rollins, &
ment (e.g., E. W. Burgess & Cottrell, 1939). Begin- Thomas, 1982).
ning in the 1960s, as movie cameras and later
video cameras became available and affordable, General problems. There are a number of prob-
observational techniques through filmed interac- lems in the area of family assessment vis-à-vis
tions became more popular. This led to the devel- measurement. Halverson (1995) cites seven:
opment of coding systems to assess the filmed
interactions, at first primarily the verbal inter- 1. Most research in this area is small-scale
actions of parents and their children. Beginning research that is never replicated.
in the 1980s, a number of self-report measures 2. There are too many measures measuring too
of family functioning began to be available, such many constructs. Some 80% of available mea-
as the Family Adaptability and Cohesion Evalu- sures have never been used more than once, and
ation Scales (Olson, et al., 1985) considered by there is no consensus on what are the most impor-
some to represent a benchmark in family assess- tant constructs in family assessment. Even the
ment (Halverson, 1995). scales that are used widely have met with a lot of
Family measurement has been affected by criticism (e.g., Conoley & Bryant, 1995; Norton,
many disciplines who bring to the field their 1983; Sabatelli, 1988).
own particular points of view and preferred mea- 3. Most measures are of unknown reliability and
surement techniques. For example, sociologists validity.
depend more on survey and interview methods,
4. There are few, if any, studies that compare the
whereas psychologists use self-report inventories,
usefulness of various constructs within the same
or those with a behaviorist bent use observational
study, such as by using a multitrait-multimethod
techniques. Currently, the emphasis seems to be
approach.
on multiple measurement strategies.
In the area of family and marriage, reliability 5. Most of the measures we have are self-reports,
and validity of measurement was ignored until instruments that elicit information from individ-
the 1960s when several publications began to uals who report on their family’s functioning.
focus on this issue. As B. C. Miller, Rollins, and 6. Family assessment measures have been devel-
Thomas (1982) state so well, “What we know and oped on relatively small and restricted samples.
how we know it are inextricably linked.” Still, reli- 7. Most instruments lack normative data.
ability and validity are often ignored in the family
literature even today.
In assessing family behavior through direct
Family-Systems Rating Scales
observation, for example, a key issue is at what
level of behavior do we observe and analyze? Many family-systems rating scales have been
For example, we could observe behaviors such as developed, although a substantial number lack
hugging, expressing approval, and joking, or we appropriate psychometric information as to their
could combine such behaviors into a behavioral reliability and validity. Carlson and Grotevant
construct of warmth. The limited evidence sug- (1987) reviewed eight rating scales used in the
gests that a broader level of analysis, the warmth assessment of family functioning and concluded
level, is more reliable and valid than a more spe- that these scales demonstrated both strengths and
cific level of behavior; similarly, rating scales that limitations, as well as differences in their reliabil-
basically summarize a variety of observational ity and validity. They concluded that such rat-
events are more reliable and valid than indi- ing scales were the “method of choice” in family
vidual observations (A. L. Baldwin, Cole, & C. assessment, even though their research utility and
Baldwin, 1982). Not everyone agrees; D. C. Bell & usefulness were questionable.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

508 Part Five. Challenges to Testing

Rating scales can be improved if what is to be Hunsley Vito, Pinsent, et al. (1996) studied the
rated is a single dimension clearly defined and relationship of impression management to var-
anchored with specific behaviors. If the rating to ious marital-relationship measures. They found
be made requires a high level of inference, the that impression management biases did not affect
results usually show lower reliability and validity. women’s self-reports. It did affect men’s self-
Rating scales can also be improved if the rater reports, but controlling for such bias statistically
is well trained. Note that a rater is not simply did not alter the results – that is, such biases
an accurate observer; he or she takes the obser- when present are minor and do not affect the
vations and dynamically integrates them into a self-reports given, at least when it comes to “com-
rating. monplace” aspects of marriage such as communi-
cation, as opposed to physical violence. However,
Cohesion and control. Two major theoretical other studies have found modest correlations
dimensions in the area of family assessment are (above .30) between measures of social desirabil-
cohesion and control. Given their importance, ity, such as the Marlowe-Crowne scale, and var-
there have been a number of attempts to oper- ious measures of marital adjustment or satisfac-
ationalize these by specific measures. Unfortu- tion (e.g., Roach, Frazier, & Bowden, 1981).
nately, different measures of these constructs do
not correlate highly with each other, even when
The McMaster Family-Assessment
they use the same method, such as self-report;
Device (FAD)
they correlate even less across methods, such as
self-report vs. behavioral measures (e.g., Oliveri The FAD is a 60-item self-report instrument
& Reiss, 1984; Olson & Rabunsky, 1972; Schmid, developed on the basis of systems theory. Apply-
Rosenthal, & Brown, 1988). ing systems theory, the family is a system, where
the various parts of the family are interrelated,
Marital quality. There are a number of marital- and the whole does not simply equal the sum of
quality scales, a term that subsumes marital the parts; the focus is on relationships, on how
adjustment, satisfaction, and happiness. Some of the family operates as a system (Epstein, Baldwin,
these scales, such as the Locke-Wallace Marital & Bishop, 1983). Thus, the FAD assesses family
Adjustment Test (Locke & Wallace, 1959) and functioning, theorized to follow six dimensions:
the Dyadic Adjustment Scale (Spanier, 1976), are problem solving, communication, roles, affec-
very well known and used in many studies, while tive involvement, affective responsiveness, and
many other scales have languished in relative behavior control. There is also an overall General
obscurity. In general, most of these scales show Functioning scale. Each dimension is rated on a
substantial psychometric limitations. Even when 7-point scale with three defined anchor points:
the overall reliability and validity are acceptable, 1 = severely disturbed; 5 = nonclinical; and
there is controversy about the factor structure (as 7 = superior. Ratings of 1 through 4 indicate fam-
in the case of the two scales mentioned above) or ily dysfunction, and therefore the need for inter-
other aspects. vention. Kabacoff, Miller, Bishop, et al. (1990)
present psychometric data on this scale using
Impression management. A basic assumption large psychiatric, normal, and medical samples.
in studying marital or cohabiting partners is that The internal reliability (Cronbach’s alpha) for the
the information they provide on self-report scales subscales ranges from .57 to .80, and from .83
is veridical. As we have seen in Chapter 16, a to .86 for the overall General Functioning scale.
major concern in the area of personality has been Two lines of evidence are presented for the con-
social desirability, and as we have also seen, social struct validity of this instrument. First, families
desirability can be conceptually divided into self- with a psychiatric member reported significantly
deception (where the person really believes the greater difficulties on all six subscales; no signif-
distorted evidence he or she is presenting) vs. icant differences were obtained between medi-
impression management, which is a positively cal families and normal families. Second, con-
biased response designed to enhance one’s self- firmatory factor analysis supported the hypoth-
image as presented to others. esized theoretical structure of six factors. For
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 509

examples of other recently developed marital family’s perception of how the family is orga-
scales, see Arellano and Markman (1995) and nized. Most of these instruments have, however,
Baucom, Epstein, Rankin, et al. (1996). lacked standardization and have, at best, mod-
est reliability and validity (Bagarozzi, 1991). An
Projective instruments. One approach to family example, is the Family Systems Test (FAST) devel-
assessment is to use projective instruments. You oped in Germany, originally in a German version,
recall that the distinguishing feature of projec- but also in English. The FAST is based on a spe-
tive tests is the presentation of relatively unstruc- cific family theory that considers cohesion and
tured stimuli that allow for an almost unlim- hierarchy as two important dynamic aspects of
ited number of possible responses. How a person family functioning. The FAST consists of a board
perceives and interprets these ambiguous stimuli divided into 81 squares, with 6 male and 6 female
presumably reflects fundamental aspects of the figures, 18 cylindrical blocks of 3 sizes, and 1 male
person’s psychological makeup and motivational and 1 female figure each in orange, violet, and
dynamics. green. The respondent, an individual, a couple, or
One of the major projective techniques is the an entire family, arranges the figures on the board
Thematic Apperception Test (see Chapter 15). to indicate cohesion, and elevates the figures with
Although the TAT was developed for use with blocks to indicate hierarchy. This is done three
individual clients, it has been used to study family times to represent a typical situation, an ideal sit-
functioning and dynamics (e.g., Winter, Ferreira, uation, and a conflict situation. Cohesion is oper-
& Olson, 1966). It has also given rise to a number ationally translated as proximity of figure place-
of special sets of cards that portray family scenes. ment, and hierarchy as the differences in height
One such set of 21 cards is the Family Appercep- between figures. Cohesion and hierarchy scores
tion Test (Julian, Sotile, Henry, & Sotile, 1991). are combined to reflect family structure in each
Seven of the cards show scenes of nuclear fam- situation, or can be combined across situations to
ily members engaged in group interaction; seven assess perceived family flexibility. There are two
cards show interactions between an adult and a additional options that involve using figures of
child, and the other seven depict a variety of situ- varying colors, and indicating degree of eye con-
ations. The scoring guide covers seven basic con- tact between figures.
cepts, such as conflict and quality of relationships, In a study of California adolescents, test-retest
and these yield nine scoring categories. Reliabil- reliability with a 1-week period was adequate (.63
ity is reported primarily by percentage of agree- to .87) with higher coefficients associated with
ment, with Cohen’s Kappa coefficients ranging older adolescents. Test-retest reliability over a
from a low of .28 to a high of .77. Several studies 4-month interval was poor (.42 to .59). The con-
of discriminant validity show statistically signifi- struct validity of the FAST is mixed. Some find-
cant differences between clinical and nonclinical ings are supportive of hypothesized relationships
subjects. This test is criticized, however, for lack and some are not. There are a number of concerns
of content validity – the nine scoring categories about this technique, which potentially may be
do not cover many of the topics central to family useful.
theory. Several of the scoring categories that are
included are poorly conceptualized, and 14 of the
21 cards may elicit responses that have nothing Assessment of couples. What are the purposes
to do with family themes. As the reviewer indi- of couple assessment? Fruzzetti and Jacobson
cates, the validity of this instrument has not been (1992) suggest that there are three purposes:
demonstrated (Bagarozzi, 1991).
1. To obtain information to be used for clinical
Semi-projective techniques. There are a num- intervention. Typically, these are couples who are
ber of techniques that are somewhat more diffi- about to enter psychotherapy, and the assessment
cult to categorize because they combine various is often broad aimed at understanding the couple
aspects of projective and objective techniques. within an overall context.
One category are figure placement techniques, 2. As part of a research assessment, perhaps a
which presumably reflect an individual’s or a study of basic relational processes. Often, these
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

510 Part Five. Challenges to Testing

are more focused endeavors and are typically the- interactions, such as the degree of demand-
ory driven. withdrawal (i.e., as one partner makes greater
3. Both clinical and research needs are present – demands, the other withdraws emotionally).
as for example in a study of couples in therapy,
where there is need to evaluate the changes that BROAD-BASED INSTRUMENTS
occur as a function of therapeutic interventions.
Most of the instruments we have covered in this
Fruzzetti and Jacobson also point out that there text are fairly specific in their objective; some, for
are many different factors that influence the func- example, attempt to measure depression, intelli-
tioning of a couple, and thus assessment may be gence, or neuroticism. There are however a num-
aimed at one of several aspects. Among the major ber of techniques that by their very nature are
categories: quite flexible and can be used in a variety of
1. Individual aspects. There are many personal endeavors. The Semantic Differential, for exam-
or individual aspects that each person brings to ple, discussed in Chapter 6, is one such technique.
a relationship that can affect that relationship. We will briefly take a look at three other tech-
These may involve psychiatric conditions, such as niques as illustrative: sociometric procedures,
depression or substance abuse, or limited inter- adjective checklists, and Q sets.
personal skills, such as the inability to commu-
nicate clearly, to express emotions appropriately,
Sociometric Procedures
to be sensitive to the other person’s needs, and so
on. In this area of assessment, inventories such as These procedures, developed in the 1930s (e.g.,
the MMPI and CPI can be quite useful. Koch, 1933) are one of the most frequently used
2. Contextual aspects. These refer to the phys- methods for measuring a child’s status within a
ical and psychological environment or context peer group (i.e., the child’s popularity or degree
that the couple finds themselves in – essentially of acceptance). These procedures essentially ask
the question of, “What is their life as a couple children to nominate children who have certain
like?” Are there children present? are there elderly attributes – for example, who are the three most
infirm parents? Are there money concerns? What intelligent children in this classroom? The ques-
is their living environment like? Is there a support tions can take a variety of forms, the nominations
system? These contextual aspects are typically may be positive or negative, the responses may be
assessed through interviews rather than psycho- open-ended or use a paired comparison (e.g., do
metric measures, although instruments do exist you prefer Tommy or Billy to be on your basket-
to assess specific aspects such as social support. ball team). Thus, there is no one procedure, but
3. Conflicts and satisfactions. How partners a rather broad array of procedures. In general,
think and feel about their relationship is of pri- three types of procedures are more commonly
mary importance, and often therapy is sought out used (Hops & Lewin, 1984):
because of real or perceived dissatisfactions. A 1. Restricted nominations. Here children in a
substantial number of questionnaires have been classroom are asked to choose a number of class-
developed in this area, such as the Marital Satis- mates for a specific situation (e.g., which three
faction Inventory (D. K. Snyder, 1979). In gen- children do you like the most?). Positive nomi-
eral, these questionnaires seem to have adequate nations for each child can be easily tabulated as
reliability and moderate validity. an index of friendship. Negative nominations can
4. Interactional processes. A couple is not sim- also be used (e.g., which 3 children do you like the
ply a sum of two individuals; most theorists least?) as an index of rejection. Such procedures
in this area assume that there is an interactive usually ask the child to write the names down,
effect, that the couple is a system with proper- but younger children can also be asked individu-
ties that go beyond a simple summation of the ally to select photographs of classmates. Various
parts. Among the more important interactive analyses can then be undertaken – for example,
processes, Fruzzetti and Jacobson list violence, children who receive no positive nominations are
communication, and physiological arousal, as “isolates,” whereas children who receive a lot of
well as the structural dynamic aspects of such negative nominations are rejected children.
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 511

2. Peer rating scales, which involve a Likert- 2. Acceptance and rejection are not inversely
type rating scale. Typically, the children are pro- related – i.e., they are basically independent
vided with a roster of their classmates, and rate dimensions rather than the opposite ends of a
each child on a 5-point scale indicating, for exam- continuum.
ple, how much they like that child. Sometimes the 3. Peer ratings are only moderately related to
verbal anchors are accompanied by or replaced teachers’ ratings.
with various sequences of happy faces.
3. Paired-comparison procedure. Here each
child is asked to choose between every possi- Adjective Checklists
ble pair of children (excluding oneself) again in
terms of a particular situation. If we have a class- Natural language. Adjectives represent a natu-
room of 25 children, then each child will make ral descriptive language with relevance and util-
253 such judgments; that is the major drawback, ity for both lay persons and scientists. Several
although short cuts are available (for illustrative adjective checklists are available in the psycho-
studies see Asher, Singleton, Tinsley et al., 1979; logical testing literature ranging from broad-
Deutsch, 1974; LaGreca, 1981). based personality assessment techniques such
Because there are so many variations in socio- as the Adjective Check List (ACL; Gough &
metric techniques, it is difficult to generalize Heilbrun, 1965), to lists developed for specific
about the reliability and validity of these tech- populations such as children and the mentally
niques. At the risk of oversimplifying, however, retarded (G. Domino, 1965; G. Domino, Gold-
we can make some general tentative conclusions. schmid, & Kaplan, 1964; Lipsitt, 1958), or to mea-
First as to reliability: sure specific variables such as affect and mood
(Lorr, Daston, & Smith, 1967; Zuckerman &
1. Reliability generally seems to be adequate. Lubin, 1965), depression (Lubin, 1965; 1967)
2. Reliability is greater with older children than or complexity-simplicity (Jackson & Minton,
with younger children. 1963).
3. Reliability of positive nominations is greater
Assessing environments. Craik (1971) revi-
than that of negative nominations.
ewed the assessment of environments and argued
4. As usual, test-retest reliability decreases as the
that the development of standard techniques for
time interval gets longer.
the assessment of environmental settings was a
5. Peer rating procedures seem somewhat more prime requirement for the advancement of envi-
reliable than restricted nomination procedures. ronmental psychological research. There are in
6. Lack of reliability may not necessarily repre- fact a wide variety of techniques in this area,
sent “error” as in the classical test theory model, from adjectival descriptions (e.g., G. Domino,
but may reflect legitimate sources of variation. 1984; Kasmar, 1970), to measures of environ-
For example, sociometric ratings made by less mental participation (Mathis, 1968), to stan-
popular children are less stable; they may simply dardized questionnaires designed to assess social
reflect their inability to form stable friendships. climates (Moos & Gerst, 1974), and to more com-
plex rating techniques (Craik, 1971).
With regard to validity the following applies A number of investigators have taken stan-
dard psychological instruments and successfully
1. In general, there is a moderate to low rela- applied them to the study of the environment.
tionship between sociometric indices and various For example, Canter (1969) used the seman-
indices of behavior. Positive correlations are typ- tic differential to assess architectural plans and
ically found between indices of social acceptance drawings. Driver and Knopf (1977) used the
and prosocial behavior, and between indices of Personality Research Form in a study of out-
rejection and negative aggressive social behav- door recreation, while Schiff (1977) used Rotter’s
ior. For example, rejected children are twice as Locus of Control Scale. Others have developed
likely to engage in aggressive behavior on the specific instruments, often modeled on per-
playground. sonality inventories, but designed to measure
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

512 Part Five. Challenges to Testing

environmental dispositions; for example, Son- For example, if 14 out of the 20 subjects checked
nenfeld (1969) developed the Environmental the word “delightful,” then a score of 14 would
Personality Inventory, while McKechnie (1975) be assigned to that word. Scores for the two sam-
developed a Leisure Activities Blank. ples were then correlated across all ECL items,
with a resulting coefficient of .78; this might be
considered an interrater reliability. A similar pro-
The Environmental Check List (ECL)
cedure, with two other samples who described a
The ECL was developed using the steps outlined new library building, yielded a coefficient of .86.
in Chapter 2. The first step was to obtain from 304
volunteers, open-ended written descriptions of Validity. Three validity studies are reported by
various geographical localities where the respon- the author (G. Domino, 1984). In the first study,
dent had lived for at least three years. These vol- a sample of 108 newcomers to Tucson filled out
unteers included college students, business exec- the ECL twice, first to describe Tucson and then
utives, and community adults in 11 different to describe their prior place of residence. A year
settings in both the United States and Canada. later they were recontacted and again filled out
The localities they described covered 42 states the ECL twice. Two questions were then asked of
and 23 foreign cities, and included large cities, vil- the data: (1) How did the perception of a new
lages, and even a psychiatric hospital and a nudist environment (Tucson) differ from the percep-
camp! tion of a familiar environment? (prior residence),
From these descriptions the author (G. and (2) What changes took place over the year’s
Domino, 1984) culled an initial list of 288 adjec- time? The results indicated that newcomers had a
tives, by eliminating synonyms, esoteric words, highly positive reaction to Tucson, both initially
foreign terms, or words with dual connotations and a year later. However, the year later descrip-
(like “cool”). The initial list was then given to two tion was less positive and probably reflected the
samples of college students. In one sample, stu- actual physical changes that had taken place in
dents checked the words they were familiar with rapidly growing Tucson, with increased traffic,
and could define, and the second sample checked construction, and concerns about water qual-
those words that were descriptive of any geo- ity and availability. There were also a number
graphical setting they were familiar with. Items of changes in the perception of their prior res-
checked by at least 70% in both samples were idence, with negative words checked more fre-
retained, resulting in a 140-item list. quently upon retest.
The ECL then consists of 140 adjectives in In the second study, five psychologists were
alphabetical order from attractive to young. Rep- asked to check ECL items that reflected a positive
resentative items are: bustling, crowded, easy evaluation. Forty-three such items were identi-
going, expensive, green, large, safe, and windy. fied by at least four of the five raters. These items
The instructions simply ask the subject to “check were then used as a scale to score the protocols of
items that describe ,” with the examiner spec- the Tucson newcomers. It was hypothesized that
ifying the target, such as “New York City,” “your those whose initial reaction to the desert envi-
hometown,” “the college library,” “your ideal ronment of Tucson was extreme, either in a posi-
retirement place,” and so on. tive or negative manner, should experience subse-
quent difficulties because such extreme reactions
Reliability. The ECL was administered twice to a would be indicative of poor reality testing, and
sample of college students over a 21-day interval, these individuals would be more likely to leave
to describe the one place where they had lived Tucson. Indeed, of 98 respondents, 26 had left
most of their lives. The test-retest correlation of within a 5-year period, and a statistical analysis
the individual items ranged from .18 to .98, with of their initial ECL endorsements supported the
a mean of .67; for the entire ECL the test-retest hypothesized relationship.
correlation was .82. A third study involved a sample of adults who
In a second study, two samples who had filled completed the ECL to describe Tucson, and in
out the ECL to describe Tucson were treated as addition indicated for each item checked whether
a unit and a score obtained for each ECL item. that item was positive or negative. Thus, two
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 513

individuals may both check the word “hot” to carried out to minimize the overlap of items from
describe Tucson. But one person may love the scale to scale and to maximize the correlations
heat and see it as a positive feature, while the between individual items and total score. Other
second person may hate that heat. In addition, scales assess such aspects as Counseling Readi-
subjects completed two scales from McKechnie’s ness, Personal Adjustment, and Military Leader-
(1975) Environmental Response Inventory: (1) ship.
the Urbanism scale which measures enjoyment The ACL Manual (Gough & Heilbrun, 1983)
of city living, and (2) the Need for Privacy scale, gives a considerable amount of data on norma-
which measures the preference for solitude and tive samples, reliability, social desirability, fac-
isolation. Finally, ECL items were identified to tor analysis, case illustrations, and other issues.
parallel the Urbanism (e.g., bustling and cos- The ACL has been used in hundreds of stud-
mopolitan) and Need for Privacy (e.g., quiet and ies, including many cross-cultural applications,
rural) scales. The hypothesis was that subjects and in the study of environments. In one study
high on urbanism, as defined by McKechnie’s reported in the Manual, introductory psychol-
scale, would not necessarily perceive Tucson as ogy students were asked to complete the ACL to
more urban, but as more positively urban, while describe the cities of Rome and of Paris. Both
low-scoring subjects would see Tucson as less males and females saw these cities in an unfavor-
positively urban. A similar hypothesis applied to able light, with females being somewhat more
the need for privacy dimension. Both hypotheses favorable toward Rome. Both cities were seen as
were supported. For example, scores on McKech- vain, self-indulgent, and indifferent to the feel-
nie’s Urbanism scale correlated .22 with number ings and wishes of others. Rome was seen as more
of urban items checked on the ECL, but corre- tenacious, while Paris was more flamboyant and
lated .58 with positive urban items checked. more feminine.

The Adjective Check List (ACL) Q Sorts and Q Sets


The best known of all adjective check lists is the Introduction. In Chapter 8 and above we talked
ACL (Gough & Heilbrun, 1983) which was ini- briefly about the Adjective Check List, a list of 300
tially developed in 1949 as a method of recording words in alphabetical order from absent-minded
the observations and reactions of the staff of the to zany (Gough & Heilbrun, 1965). Imagine that
Institute of Personality Assessment and Research we took a random subset of these words, say 100
of the University of California (Berkeley), as of them, and printed each word on a 3 × 5 index
the staff studied subjects participating in vari- card. We could now give the deck of cards to a
ous assessment programs. Thus, the ACL was an subject and ask that person to sort the 100 items
observer instrument; it quickly evolved however into 9 separate piles or categories to describe
into a personality inventory as a self-report mea- someone (the person doing the sorting, or their
sure, which contains some 37 scales, some devel- ideal self, or their spouse, or their ideal spouse,
oped rationally and some empirically. Although or the “typical” alcoholic, etc.), along a dimen-
this instrument could have been presented in sion of representativeness, with most character-
Chapter 4 on personality, I include it here because istic items in pile 9 and least characteristic items
of its broad applicability to various areas, includ- in pile 1. The person has now done a Q sort, and
ing the assessment of environments. the deck of cards is called a Q set (although you
Fifteen of the 37 scales assess needs, as reflected will find the term Q sort applied both to the set
in H. A. Murray’s (1938) need-press theory, of and to the sort). If this procedure has a deja vu
personality. These scales were developed by ask- feeling, it is because this is the same procedure
ing 19 graduate students in psychology to check that was used in the Thurstone method of equal-
which adjectives if endorsed would indicate a par- appearing intervals to create an attitude scale (see
ticular need, such as need Dominance. If at least Chapter 6).
9 of the 19 judges agreed on a word, then that There is nothing sacred about the number 9,
word was included in the scale. Once the scales so we could instruct the subject to use 5, 7, 11, or
were developed, various statistical analyses were 681 categories. Ordinarily, we would use an odd
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

514 Part Five. Challenges to Testing

number of categories (9 rather than 10) so that might meet to discuss a new child with multiple
there is a midpoint, and we would use a number handicaps. Often communication does not move
that is not too onerous a task (681 would defi- forward as well as it ought to because each spe-
nitely be out), and does not collapse the data into cialist has different information, a different per-
few categories (e.g., 3 categories would not be a spective, different jargon, etc. If each were to do a
wise decision). As with any other rating task, it is Q sort on the child and the different Q sorts were
better to have each category labeled, rather than compared, the results could provide a common
just the extreme ones. language and perspective.
We could allow our sorter complete freedom
in placing any number of statements in any one Psychometric analyses. We ask Dr. Smith to do
pile, as is done in the Thurstone method, but this a Q sort to describe Richard, who is a patient
freedom is not a particularly good psychome- in therapy. Dr. Smith uses a standardized Q set,
tric procedure. Therefore we would instruct the namely the California Q set which is probably
sorter as to how many statements are to be placed the best known Q set (Block, 1961). We can now
in each category. For a 100-item Q set, using 9 cat- correlate Dr. Smith’s Q sort with one provided in
egories, we would use the following distribution: Block’s monograph, on the optimally adjusted
5, 8, 12, 16, 18, 16, 12, 8, and 5. Note that the dis- personality, that reflects the modal sorting by
tribution is symmetrical around the center point; nine clinical psychologists. Block (1961) provides
indeed these numbers come from the normal or a convenient formula to use, which is a deriva-
bell-shaped distribution. tive of the standard formula for the correlation
You realize by now that a Q-set is an ipsative coefficient:
measure, and therefore a total score and norma-  2
d
tive interpretation of that score are not appro- r =1−
priate. However, we do have a set of 100 items 2N(SD2 )
distributed along 9 categories, so that for each 
where d = the sum of the squared differ-
item that item’s score is the number of the cat-
ences between Q values of corre-
egory. If I place the word “intelligent” as one of
sponding items
the 5 items in pile number 9 as most descriptive
N = the number of items in the Q set
of me, then the word intelligent can be assigned a
SD = the standard deviation of the Q
score of 9. Note that by convention we use higher
set
numbers for most descriptive items and lower
numbers for least descriptive items. How can Q  2
To calculate the d we would compare
sorts be used psychometrically? There are actu- placement of items, such as in the example below:
ally a number of ways, from an impressionistic
point of view as well as psychometrically. Category Category
by Dr. optimally
Impressionistic analyses. Suppose we ask John, item Smith adjusted d d2
a college student seeking assistance at the Student warm 8 9 1 1
Counseling Center, to do a Q sort. We can then go dependable 6 8 2 4
over with him his placement of the items, obtain etc.
a lot of psychodynamic impressions (for exam- SUM:
ple, he describes himself as nervous and peculiar,
and places the items creative and innovative in
pile number 1, and even discuss his choices as a Nature of items. The items of a Q set need not
therapeutic starting point. necessarily be personality type items. They could
Consider this other example. Professionals be for example, attitudinal statements sorted
from different fields that work in school or clini- on an approval-disapproval continuum, state-
cal settings often come together in a case confer- ments dealing with career and vocational aspects
ence to discuss a particular client. In an elemen- (e.g., Kunert, 1969), study habits (Creaser, 1960),
tary school the school psychologist, counselor, or even artistic drawings sorted on aesthetic
special ed teacher, school nurse, principal, etc., dimensions (Stephenson, 1953). In fact, Q sorts
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

Testing Behavior and Environments 515

have been applied to a rather wide number of An example. Q sorts not only can yield useful
domains such as attitudes about abortion (P. D. empirical data but, perhaps more important, can
Werner, 1993), defense mechanisms (Davidson provide a useful methodology to test theoreti-
& MacGregor, 1996), criminology (J. H. Carl- cal ideas. For example, Metzger (1979) attempted
son & Williams, 1993), and marital functioning through Q methodology, to assess a theory pro-
(Wampler & Halverson, 1990). posed by Kübler-Ross (1969) who hypothesized
that a dying person goes through stages of denial,
anger, bargaining, depression, and acceptance in
Psychometric aspects. To obtain statistical sta-
coming to grips with impending death. Metzger
bility and acceptable reliability, the number of
developed a 36-item structured Q set from a
items in a Q set should probably be substan-
larger pool of items – items such as “Why did God
tial, maybe 80 to 100, but not much larger or
let this happen to me?” and “I pray to be well.”
it becomes a rather onerous task.
She administered the Q set to two couples, in
Most Q sets are what Kerlinger (1986) labels
which the wife had a potentially terminal medical
as unstructured; that is, they consist of a set of
diagnosis. Although the results did not support
items like personality statements that have been
Kubler-Ross’ stage theory, the author concluded
put together without specific regard to a structure
that the Q methodology was a useful procedure
that might underlie such items. A structured Q set
to investigate terminal illness.
is one where the items have been selected to reflect
a specific theoretical or empirical structure. For
example, a Q set designed to assess differential SUMMARY
diagnosis between two distinct personality types
would contain an equal number of items perti- In this chapter, we looked at behavioral assess-
nent to each type. As another example, Kerlinger ment, various basic issues, and illustrative instru-
(1986) reports a study in which a 90-item Q set ments like the Pleasant Events Schedule and the
was used, where each item reflected one of the six Fear Survey Schedule. We also looked at how tests
major values in Spranger’s theory (see Chapter can be used in program evaluation, the assess-
19). For example, words such as God and church ment of environments, and the assessment of sys-
represented the Religious value, and words such tems like the family. Finally, we looked at socio-
as science and reason represented the Theoretical metric techniques, adjective checklists, and Q
value. Subjects like ministers presumably high on sets – all broad-based techniques that can be used
the Religious value, and musicians presumably in a variety of ways.
high on the Aesthetic value, were then asked to
sort the items according to the degree to which
they favored or did not favor the words, with 15 SUGGESTED READINGS
items representing each of the 6 values. Calvert, J. D., Moore, D. W., & Jensen, B. J. (1987). Psy-
chometric evaluation of the Dating Anxiety Survey: A
self-report questionnaire for the assessment of dating
Wide applicability. Q sets have found wide
anxiety in males and females. Journal of Psychopathol-
application not just in psychology but in other ogy and Behavioral Assessment, 9, 341–350.
fields such as political science (e.g., S. R. Brown,
This article evaluates an instrument designed to assess dat-
1980) and nursing (e.g., Dennis, 1986). Q sorts ing anxiety. The authors report the results of two studies in
can be done not only by adults but by children which a factor analysis, an assessment of reliability (internal
as well. For example, V. D. Bennett (1964) devel- consistency), and concurrent validity, were investigated.
oped two forms of a self-concept Q set, each made
up of 26 statements, and administered these to Cautela, J. R. (1994). The use of the Anxiety Meter to
reduce anxiety. Behavior Modification, 18, 307–319.
sixth-grade students. Each child was asked to sort
the items into five categories from “most like me” Describes the Anxiety Meter in treating anxiety reactions,
especially in agoraphobia and panic reactions. A good illus-
to “most unlike me,” with the Q cards placed in tration of behavioral assessment and therapeutic application.
designated pockets rather than in piles. Q sets
have also been used with the elderly in geronto- Goldfried, M. R., & Kent, R. N. (1972). Traditional vs.
logical research (e.g., Hayslip, 1984). behavioral personality assessment: A comparison of
P1: JZP
0521861810c18 CB1038/Domino 0 521 86181 0 February 24, 2006 9:22

516 Part Five. Challenges to Testing

methodological and theoretical assumptions. Psycho- DISCUSSION QUESTIONS


logical Bulletin, 77, 409–420.
1.Compare and contrast classical with behavioral
A review of the assumptions that underlie traditional vs.
behavioral assessment, as these assumptions impact the selec-
assessment.
tion of test items and the interpretation of the responses to 2. What are some of the limitations of direct
the test. observation?
Greenspoon, J., & Gersten, C. D. (1967). A new look 3. The chapter discusses penile tumescence
at psychological testing: Psychological testing from the measurement. Can you think of other tech-
point of view of a behaviorist. American Psychologist, niques that are or might be used as behavioral
22, 848–853. assessments?
Perhaps more of historical interest, this paper proposes a “new 4. Many people have a phobia, or at the very least,
look” at psychological testing, contrasting the classical psy-
chodynamic view with what, in 1967, was a new point of view an uncomfortable feeling with hypodermic nee-
– that of the behaviorist. dles. How might you go about assessing such a
fear?
Holm, J. E., & Holroyd, K. A. (1992). The Daily Hassles
Scale (Revised): Does it measure stress or symptoms? 5. Using the three Moos categories of class-
Behavioral Assessment, 14, 465–482. room environments (interpersonal relationships;
A study of the Daily Hassles Scale-Revised from a factor- structural aspects; and goal orientation) how
analytic perspective. would you classify your classroom?
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

19 The History of Psychological Testing

AIM In this chapter we take a brief look at the history of psychological testing and
a peek at the future. For didactic purposes, we consider the current status of psycho-
logical testing as reflecting four major strands: the French clinical tradition, the Ger-
man nomothetic tradition, the British idiographic tradition, and the American applied
tradition.

public officials were examined every three years


INTRODUCTION
and were either promoted or dismissed on the
basis of these examinations, which covered such
When you meet a new person at a party, for exam- areas as writing, music, archery, and ceremonial
ple, you want to know their name and a little rites (P. H. DuBois, 1966). Bowman (1989) argues
bit about their background – what’s their fam- that such formal testing procedures may go back
ily like, where did they grow up, and so on. In “only” 2000 years rather than 4000; but the
this text, you have “met” psychological testing; it conclusion is the same.
is time to take a brief look backwards. Psycho- The Book of Judges, in the Old Testament,
logical testing is, in American society, at least, tells of what might be considered the first sit-
quite ubiquitous. Children are tested in a variety uational test. Gideon, a hero of the Israelites, is
of ways, from preschool days through middle- selecting an army to fight the Midianites. As a
school graduation. High-school adolescents are result of Gideon’s pep talk on the hazards of war,
given aptitude tests to determine what they can his 32,000 volunteers are reduced to 10,000, a
master, achievement tests to assess what they have number that for tactical reasons is, however, still
mastered, minimum competency exams to deter- too large. God then advises Gideon to take his
mine whether their diploma should be granted, prospective soldiers to the river for a drink; those
interest tests to help them choose careers, and that lap the water are to be accepted, but those
college-entrance examinations to allow them to that kneel to drink and hence expose their backs
progress in the educational stream. Similarly, col- to any potential enemy, are to be rejected. Only
lege students come face to face with a variety of 300 passed this test and went on to victory over
educational and psychological exams, and if you the Midianites – clear evidence of validity! The
think college graduation means the end of test- same Biblical book also provides another exam-
ing, you are in for a big surprise! ple of testing – a one-item verbal test that involved
Psychological testing is a recent phenomenon, pronouncing the word “shibboleth.” Ephraimites
closely interwoven with 20th-century American pronounced the initial sh as “s” and were thus
culture, yet the use of systematic procedures for identified and executed on the basis of that lin-
comparing and evaluating individuals is quite guistic giveaway (Wainer, 1987). The ancient
old. In China for example, in the year 2200 b.c., Greeks also made use of tests, both individual

517
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

518 Part Five. Challenges to Testing

and group, primarily to assess physical achieve- Jean Esquirol (1772–1840) was one of these
ment (K. O. Doyle, 1974). physicians. A pupil of Pinel, he is perhaps best
African tribal life also contains numerous remembered for a text on Des Maladies Mentales,
examples of ritualized procedures aimed at eval- which for several decades remained a fundamen-
uation, which might in a loose sense be consid- tal text of psychopathology. Esquirol was one of
ered tests. For example, in Kenya, witch doctors the first to sketch the main forms of insanity
often place a live toad in the patient’s mouth to and to apply elementary statistical methods to his
determine whether the patient is really sick or clinical descriptions, mainly in the form of tabu-
merely malingering. If the patient is faking, the lations of causative categories. He was one of the
toad presumably will jump down his throat. first to explicitly differentiate between the men-
Today, psychological tests have become an tally ill and the mentally deficient, and to propose
intrinsic part of our lives, despite the inten- that various degrees of mental deficiency could
sive criticisms they have received. From aviation be distinguished on the basis of the patient’s use
cadets to zoologists, there are relatively few indi- of language. (See A. S. Kaufman, 2000)
viduals who have not, at one time or another, Another famous French physician was
been tested. Despite this pervasiveness of psycho- Edouard Seguin (1812–1880), a pioneer in the
logical tests, their direct historical antecedents field of mental deficiency. Seguin emphasized
are rather short. We take a brief look at some of the importance of training the senses in the edu-
the historical figures and ideas that have shaped cation of the mentally deficient and developed
the present status of psychological tests, although many procedures to enhance their muscular
we have already met a number of these. Psychol- control and sensory abilities. Some of these
ogy does not exist in a vacuum but represents the procedures were later incorporated into tests of
work of specific people living in a particular cul- intelligence.
ture at a particular point in time. Although the
individual names are important, what is more Hysteria and hypnosis. Ambroise-Auguste
important are the societal movements that both Liebeault (1823–1904), Hippolyte Bernheim
impinge on the activities of individuals and are, (1840–1919), Jean Charcot (1825–1893), and
in turn, changed by these activities. Pierre Janet (1859–1947) are well-known pro-
As stated in the Introduction, four distinct tagonists in the history of hysteria (physical
but related movements are of primary concern symptoms such as blindness without the under-
to a historical understanding of today’s psycho- lying organic damage, due to psychological
logical testing: (1) the French clinical tradition, factors) and hypnosis, and they too reflect
(2) the German nomothetic approach, (3) the the French clinical tradition. Liebeault and
British idiographic approach, and (4) the Amer- Bernheim, physicians in the city of Nancy,
ican applied orientation. proposed that hypnotism and hysteria were
related, and both due to suggestion. Charcot,
a leading neurologist and superintendent of
THE FRENCH CLINICAL TRADITION
the Salpetriere Hospital in Paris, disagreed and
Every student of abnormal psychology knows of argued that organic factors were also operative
Philippe Pinel (1745–1826), the French physician in hysteria. It was to Charcot’s clinic that a young
who, when placed in charge of the Bicetre Hospi- Viennese physician by the name of Sigmund
tal for the Insane in 1793, freed the patients from Freud (1856–1939) came to study, and it was
their chains: Pinel’s action represents a turn- the work of the French physicians on the use of
ing point from a demonological explanation of hypnosis on hysterical patients that led Freud
mental illness (the insane are possessed by the to postulate such concepts as the unconscious.
Devil) to a medical and humanitarian one (the (See Libbrecht & Quackelbeen, 1995, on the
insane are ill); more basically, his action reflected influence of Charcot on Freud.)
the possibility of explaining behavior by natural Janet also came under the influence of Char-
rather than supernatural causes, a possibility that cot. He is perhaps best known for establishing the
in the area of psychopathology was explored in dissociation school of psychopathology, a theo-
detail by many French physicians. retical view of the mind as a system of forces in
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 519

equilibrium, whose dissociation or splitting off the scale were chosen because they were related
would be reflected in pathological behavior. Janet to age, school attainment, and teachers’ judg-
was a keen observer whose rich clinical experi- ments of intelligence of the pupils, not because
ences were transformed into vivid descriptions; they reflected the theoretical preconceptions of
his description of hysteria, for example, is con- the investigator. The Binet-Simon scale clearly
sidered a classic. demonstrated the greater accuracy of objective
Although sweeping generalizations are of their measurement over clinical and personal intu-
very nature incorrect, it can be said that the efforts ition, an issue that is basic to psychological
of French investigators were in general heavily testing.
grounded on clinical observations. This resulted As you know, in 1908 Binet and Simon pub-
in perceptive and detailed descriptions of vari- lished a revision of their scale, and in 1911 Binet
ous types of mental aberrations, but little con- published another revision involving a number
cern over quantification of these clinical insights. of technical refinements (a detailed description
Psychological testing must begin with sensitive of these scales can be found in Peterson, 1926).
observations of human behavior, and the French Although Binet had devoted much effort to the
clinical tradition provided this; but additional investigation of separate mental faculties, his
steps must be taken. scales reflected a global approach to intelligence,
with no attempt to measure the relative contri-
Alfred Binet. Additional steps were taken by bution of each faculty to the total functioning of
Alfred Binet (1857–1911), a French psychologist a child (T. H. Wolf 1969a, 1969b).
who occupies a central position in the history In summary, modern psychological tests owe
of testing. Although Binet is best known for the to the French tradition an emphasis on pathol-
scale of intelligence he devised in 1905, he made ogy, a clinical descriptive approach and, mainly
many other contributions, including the estab- because of Binet’s work, a practical, empirical ori-
lishment of the first French psychological labo- entation (for a review of intelligence testing in
ratory in 1889, and the first French psychological France after Binet, see W. H. Schneider, 1992).
journal, called L’Annee psychologique, in 1895.
Binet held that differences among individuals
were due to differences in mental functioning,
THE GERMAN NOMOTHETIC APPROACH
particularly in faculties such as reasoning, imagi-
nation, attention, persistence, and judgment. He Germany gave us, in 1692, the earliest instance
began a thorough investigation of what tech- of the use of rating scales in personality evalu-
niques might be used to measure these faculties. ation and the first documented use of numer-
Although German and British investigators had ical scales to represent psychological variables
already developed a variety of procedures, Binet (McReynolds & Ludwig, 1984; 1987), but our
criticized these as emphasizing sensory aspects story begins with a later and much more central
and reflecting simple, rather than complex, intel- protagonist.
lectual functions. As every psychology student knows, the science
Binet’s work culminated in the 1905 Binet- of psychology was born in 1879 at the University
Simon scale, the first successful test of general of Leipzig, Germany, with Wilhelm Wundt (1832–
intelligence, whose aim was to identify those 1920) as both puerpera and midwife (apparently
school children who could not profit from reg- a strong case can be made for 1875 rather than
ular school instruction. Although earlier tests 1879; see Boring, 1965; R. I. Watson, 1966). The
than this scale existed, they were not standard- work and contributions of Wundt and his stu-
ized, so Binet’s work is recognized as first (Bondy, dents comprise a basic chapter in the history
1974). As discussed in Chapter 5, the 1905 scale of experimental psychology (see Boring, 1957,
contained some 30 questions ranging from psy- for a review, and Bringmann, Balance, & Evans,
chophysical tasks such as comparing two weights 1975, for a brief biographical sketch). There are
to making rhymes and following a moving object at least four major reasons why their work is of
with one’s eyes. The 1905 Binet-Simon scale was importance to the development of psychological
an empirical scale; the thirty tests that made up testing:
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

520 Part Five. Challenges to Testing

1. Their experimentation marks the turning as naming three different smells, reading aloud as
point from a philosophical, armchair, specula- fast as possible, and stating as quickly as possi-
tive approach about human nature to an empir- ble the colors of ten objects. Munsterberg was
ical approach based on quantification. French later brought to Harvard University by William
investigators were already looking at abnormal James (1842–1910) and became a dedicated prac-
behavior from a medical-scientific viewpoint, titioner of the applications of psychology to law
yet they were relatively unconcerned with sys- and industry.
tematic experimentation, or with investigating Germany has also contributed a longer tradi-
normal behavior. tion of viewing a problem in both its philosoph-
2. There was a heavy emphasis on the measure- ical ramifications and its taxonomic potential
ment of sensory functions, such as differential (Misiak & Sexton, 1966). A well-known exam-
sensitivity to various modes of stimulation. This ple of this is the work of Eduard Spranger (1882–
was later reflected in the early mental tests that 1920). Spranger postulated that there are six main
included measures of weight discrimination, pain types of values, and that individuals tend to cen-
sensitivity, reaction time, and other functions. ter and organize their lives around one or more
These were the tests that Binet criticized as too of these values. The six values are: (1) the the-
sensory. oretical, aimed at the discovery of truth; (2) the
3. The use of complex brass instruments in their economic, characterized by utility; (3) the aes-
psychological experiments and their emulation thetic, centering on form and harmony; (4) the
of the physiologists and physicists underscored social, focused on love of people; (5) the polit-
the need for standardization of procedures and ical, interested primarily in power; and (6) the
experimental conditions. Modern psychological religious, directed toward understanding the
tests are characterized by such standardization, totality of being.
where all subjects are exposed to essentially the The Allport-Vernon-Lindzey Study of Values,
same task, with the same instructions and scoring originally published in the United States in 1931,
standards. represents an attempt to measure the relative
prominence of these six basic orientations and
4. Wundt was a university professor, and test-
was for many years a test used widely in social
ing, as well as most of psychology, grew up as an
psychology.
academic discipline. The role of the psychologist
The approach of the German psychologists was
became primarily that of a scholar-researcher,
a nomothetic one, aimed at discovering the funda-
and the practitioner aspects were added very
mental laws that govern the human mind. Inves-
slowly.
tigative efforts were directed toward determining
A few German psychologists, most of whom the fundamental workings of visual perception,
were pupils of Wundt, did assemble test batter- auditory sensation, reaction time, and other sim-
ies designed to measure complex functions. Emil ilar problems. Differences between individuals in
Kraepelin (1855–1926) for example, investigated their reaction time or in their ability to discrim-
the psychological effects of fatigue in psychi- inate sounds were regarded as nuisance. In fact,
atric patients and made extensive use of the free- much experimental control was applied to the
association method, a task requiring the subject elimination of these individual differences.
to respond to a stimulus word with the first word
that comes to mind. Kraepelin also published a
THE BRITISH IDIOGRAPHIC APPROACH
diagnostic classification regarded as the precur-
sor of our current Diagnostic and Statistical Man- The British, on the other hand, were vitally inter-
ual of Mental Disorders (Zilboorg, 1941). Her- ested in these individual differences and viewed
man Ebbinghaus (1850–1909), well known for his them not as error, but as a fundamental reflec-
monumental investigations of memory, adminis- tion of evolution and natural selection, ideas
tered tests of arithmetic, memory span, and sen- that had been given a strong impetus by the
tence completion to groups of school children. In work of Charles Darwin (1809–1882). It was Dar-
1891, Hugo Munsterberg (1863–1916) described win’s cousin, Sir Francis Galton (1822–1911),
various tests he had given to school children such who united the French clinical tradition, the
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 521

German experimental spirit, and the English con- and error being the work of a myriad influences
cern for variation, when he launched the testing that resulted in an increased or decreased test
movement on its course (A. R. Buss, 1976). score (see Spearman, 1930, for an autobiogra-
Galton studied eminent British men and phy). This concept, aside from its basic impor-
became convinced that intellectual genius was tance in testing, did much to revive interests in
fixed by inheritance. He then developed a num- tests, an interest that had been demoralized by
ber of tests to measure this inherited capacity the lack of relation between test scores and school
and to demonstrate individual differences. He grades.
established a small anthropometric laboratory in Cyril L. Burt (1883–1971), an outstanding con-
a London museum where, for a small fee, visi- tributor to educational psychology, wrote books
tors could undergo tests of reaction time, hear- on testing and on factor analysis. He also pub-
ing acuity, color vision, muscular strength, and lished an English revision of the Binet scale in
other basic sensorimotor functions. To summa- 1921, designed for children aged 3 and older
rize the information he collected on approxi- (see A. D. Lovie & P. Lovie, 1993, on how
mately 10,000 persons, Galton made use of or Burt was influenced by Spearman). After his
himself developed statistical procedures. He also death, Burt became rather controversial as he
realized that a person’s score could be expressed may have committed fraud in his famous studies
in a relative rather than an absolute manner (e.g., of the intelligence of separated twins (Joynson,
John is taller than these 18 men, rather than John 1989; Osborne, 1994; W. H. Tucker, 1994; 1997).
is 6 feet tall). This is an extremely important and Godfrey H. Thompson (1881–1955) developed a
fundamental concept in psychological measure- number of tests for the measurement of school
ment, yet one that can easily be neglected. achievement; he also supervised testing pro-
Galton’s ultimate objective was the early iden- grams, conducted large population surveys, and
tification of geniuses so that they could be played a major role in British educational psy-
encouraged to reproduce and thus improve chology. Karl Pearson (1857–1936), a statistician
the intellectual level of the human race. This and associate of Galton, made a large number
grandiose idea, however, was not supported by of contributions; his name is perhaps best asso-
the test results, for although sensorimotor func- ciated with the correlation coefficient. Pearson
tions demonstrated individual differences, they was one of the first to apply correlational analy-
showed little relationship to other criteria of sis to the study of parent-child resemblance; for
intelligence. example, he measured the height and arm span of
parents and of their children and reported a cor-
Factor analysis. Although it was Galton who laid relation of approximately .50, a result that Pear-
the groundwork for the application of statistical son interpreted as reflecting the contribution of
methods to psychological and educational mea- heredity.
surement, many others made important contri- Ronald A. Fisher (1890–1962) was another
butions as well. One of these was Charles Spear- outstanding contributor to the development of
man (1863–1945), best known for his two-factor factor analysis; he is best known for developing
theory of intelligence, which stated that all cog- techniques for the analysis of variance and small-
nitive activities are a reflection of general intel- sample statistics. Fisher was a statistician and
ligence, labeled as g, and specific abilities or geneticist whose contributions made statistics a
factors. Because specific factors are specific to a tool at the disposal of psychology (Porter, 1986).
particular test or activity, testing efforts should It is interesting to note that the British
concentrate on measuring general intelligence. approach to the organization of abilities still
Spearman’s statistical support for his theory, the emphasizes Spearman’s g factor, while Amer-
intercorrelations between tests, represents a first ican psychologists consider g as of secondary
step in the application of factor analysis. Spear- importance and prefer to talk about group
man also contributed the concept of test relia- factors – factors common to many activities,
bility. He considered a test score to be the sum but not all. Mention might also be made that
of truth and error, truth being the individual’s early in the 19th century, the British established
actual standing on the variable being measured, a civil-service examination system, which was
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

522 Part Five. Challenges to Testing

heavily influenced by the British missionaries and the reaction time for a sound, and the measure-
foreign-service personnel who had lived in China ment of dynamometric pressure (the pressure
and had observed the extensive Chinese exami- exerted by one’s grip). Cattell felt that differ-
nation system. ences in sensory acuity, reaction time, and simi-
lar functions would result in differences in intel-
The Scottish surveys. At the time of Galton, lectual achievement; hence he administered his
the British considered themselves the pinnacle tests to Columbia University college freshmen
of evolution. They were concerned however, that in the hope of predicting their college achieve-
the nobility which obviously represented high ment (Sokal, 1987). Incidentally, Cattell and his
levels of intelligence, had smaller families than wife had seven children, one of whom named
the lower classes, who also obviously had lower “Psyche” also became a well-known psychologist
intellectual capabilities. Because there had been (Sokal, 1991).
found a negative correlation between intelligence Cattell’s 1890 paper resulted in a tremendous
and family size, with children from larger families interest in mental testing. Psychologists at various
tending to have lower IQs than those from smaller institutions began developing similar tests and
families, there was concern that, in several gen- in 1895 the American Psychological Association
erations, the lower classes would gain the upper formed a committee to investigate the possibil-
hand at least numerically, and Englnd would thus ity of having various psychological laboratories
become a nation of “morons.” Several large scale cooperate in the collection of data.
intelligence testing programs were undertaken.
Mention might be made of the so-called Scottish Other American investigators. Cattell was not
surveys, carried out in Scotland under the direc- the first nor the only American investigator con-
tion of the Scottish Council for Research in Edu- cerned with testing. Joseph Jastrow (1863–1944),
cation. The first survey in 1932 was an attempt a student of G. Stanley Hall, and holder of the
to test all 11-year-old children. Approximately first PhD in psychology awarded in the United
87,500 children, or 90% of all 11-year-olds, were States, had developed a set of 15 tests that he
administered a group test of 76 verbal and 9 pic- demonstrated to visitors at the 1893 Columbian
torial items. A subgroup of 1,000 children was Exposition held in Chicago. These tests, with a
also given the 1916 Stanford-Binet. heavy Wundtian flavor, included weight discrim-
In 1947, the same group test was again admin- ination, reproduction of letters after tachisto-
istered to all 11-year-olds, this time to approxi- scopic presentation, tests of color blindness, and
mately 71,000. One of the interesting results of tests of reaction time (see Jastrow, 1930, for an
this survey was to indicate a small, but statisti- autobiography).
cally significant increase in mean test score from Lightner Witmer (1867–1956) who also stud-
the 1932 survey. This finding was contradictory ied with Wundt, established the first psycholog-
to the postulated decline. ical clinic in the United States, in 1896 at the
University of Pennsylvania (McReynolds, 1987;
O’Donnell, 1979). As the number of clients
THE AMERICAN APPLIED ORIENTATION
referred to the clinic grew, Witmer began col-
Despite Wundt’s view of individual differences lecting tests and using these in diagnostic evalu-
as error, one of his assistants was vitally inter- ations (for a survey of the clinic’s activities from
ested in variation and, in fact, wrote his doctoral 1896 to its closing in 1961, see Levine & Wishner,
dissertation on individual differences in reaction 1977). Soon other university clinics were estab-
time. This assistant was James McKeen Cattell lished, and diagnostic evaluation based on psy-
(1860–1944), an American who is credited with chological test results became part of the clinical
being the first to use the term “mental test” in an approach. Until the late 1920s, a large portion
1890 paper in which he presented in some detail of the clinic psychologist’s activities consisted
a series of 10 tests to measure a person’s intel- of administering intelligence tests. Beginning in
lectual level. These 10 tests involved such proce- the 1930s however, the psychological horizons
dures as the estimation of a 10-second interval, expanded, and the use of projective tests to study
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 523

the deeper aspects of personality became a more University (Joncich, 1966). In 1904, he published
important function. Most of these clinics served the first textbook on educational measurement,
children rather than adults, with the result that An Introduction of the Theory of Mental and Social
psychologists often became identified as child Measurement. This book presented many statis-
experts. tical concepts based on the work of Galton and
Interest in testing was also reflected by the Pearson, in relatively easy-to-understand lan-
increasing number of reports related to testing guage. Thorndike also developed a number of
that appeared in psychological and educational psychological tests and wrote voluminously on
journals. Three reports can be considered repre- measurement. His work made Columbia Univer-
sentative. In 1892, T. Bolton reported a compar- sity a center for the investigation of mental test-
ison of children’s memory span for digits with ing, even though his concern was not specifically
teachers’ estimates of their general mental abil- aimed toward the measurement of general intel-
ity. J. A. Gilbert in 1894 administered several tests ligence (see the article written by his grandson,
(including reaction time, memory, and sensory R. M. Thorndike, 1990).
discrimination) to about 1,200 Connecticut chil- In the 1860s, several United States legisla-
dren and compared the results with teachers’ esti- tors introduced bills to establish a civil-service
mates of general ability. Both studies indicated a examining commission, modeled on the British
correspondence between some tests and teachers’ one. By the late 1800s, several examinations had
estimates, although the correspondence was low been developed to assess postal clerks, trademark
and of little practical significance. W. C. Bagley examiners, and other occupations.
in 1901 compared the mental abilities of school
children as evidenced by school marks and test Back to Cattell. The great enthusiasm about test-
scores and found an inverse relationship between ing generated by Cattell’s work was, however, of
these two types of abilities. short duration and was soon dampened by the
In 1903 Helen Thompson published the results lack of interest on the part of several well-known
of a study of sex differences. In this study, she used psychologists, such as William James, and more
sensory acuity tests, associative reaction time, a importantly, by the results of two research papers,
general information test, tests of ingenuity (prob- one published by Stella Sharp and the other by
lem solving), and various other measures. She Clark Wissler (F. S. Freeman, 1984).
reported a number of sex differences, such as S. E. Sharp (1898–1899) administered a bat-
men scoring higher in the ingenuity tests whereas tery of mental tests to seven students. These tests,
women scoring higher on tests of memory. which were administered repeatedly, were more
In the area of educational testing, mention complex than those of Cattell and involved tests of
might be made of Horace Mann (1796–1859), imagination, memory, and other “higher” func-
the famous educator, who took what might be tions. She reported that, in general, there was little
the first major step toward standardized testing self-consistency in the test results, i.e., low relia-
when, in 1845, he replaced the oral interrogation bility.
of Boston school children with a written exami- Wissler (1901) compared Cattell’s psycholog-
nation that presented to all children a uniform set ical tests with anthropometric measures (e.g.,
of questions. Unfortunately, this work had neg- height, weight, arm span) and with college
ligible impact on educational practices. grades. He reported that Cattell’s tests did not
Joseph M. Rice (1857–1934) made, in 1895, lists correlate highly with either the anthropomet-
of spelling words as well as arithmetic and lan- ric measures or the college grades. In fact, they
guage tests, and administered these to thousands did not correlate with one another! For exam-
of children in an effort to objectively evaluate the ple, the correlation between strength of hand and
relative effectiveness of teachers. This work also class standing was –.08, while reaction time and
did not receive the attention it deserved. class standing correlated –.02. Grades in vari-
One of the most outstanding contributors to ous courses however correlated substantially with
psychological and educational measurement was each other – for example, Latin and Math corre-
Edward Lee Thorndike (1874–1949) of Columbia lated +.58.
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

524 Part Five. Challenges to Testing

Although both studies had serious method- others as clear evidence of the hereditary nature
ological limitations, and they were not meant of these conditions (Fancher, 1987; J. D. Smith,
by their authors to be indictments of testing, 1985).
they were nevertheless taken by many as proof
that mental measurement could not be a reality. Lewis M. Terman (1877–1956). Although God-
As a result, when the 1905 Binet-Simon scale dard’s translation of the Binet-Simon scale and
appeared, psychologists in American universities his passionate proselytizing statements made the
exhibited little interest; however, psychologists Binet-Simon scale a popular and well-known test,
and educators working in clinics and schools, it was the work of Lewis M. Terman, for 20 years
faced with the same practical problems of mental (1922–1942) head of the Psychology department
classification that Binet was struggling with, were at Stanford University, that made the Binet test
quite receptive and rapidly enthusiastic. not only the best-known intelligence test, but
a yardstick against which new tests were sub-
Mental deficiency. One such man was Henry sequently compared (see Terman, 1932, for an
Goddard (1866–1957), Director of Research at autobiography).
the Vineland Training School, a New Jersey insti- Terman’s revision of the Binet test, which
tution for the mentally defective. At Vineland, included a restandardization on an American
Goddard founded the first laboratory for the psy- sample of more than 2,000 subjects, was not
chological study of the retarded and was also a merely a translation, but virtually resulted in a
pioneer in using a team approach to research, new test, the 1916 Stanford-Binet. As we saw, one
involving house mothers, psychometricians, and of the major innovations of this test was the use
others. Goddard translated the Binet scale into of the IQ, the ratio of mental age to chronolog-
English and became an outspoken advocate of ical age, a concept that had been proposed by
Binet’s approach. In working with mental defec- William Stern (1871–1938), a German psycholo-
tives, Goddard had become acutely aware of gist (Stern, 1930). Unfortunately, the subsequent
the fact that their potential for learning was popularity of the IQ concept made it synonymous
extremely limited and apparently physiologically with intelligence and was soon considered a prop-
determined. It was an easy, though incorrect, step erty of the person rather than of the test. In addi-
to assume that intelligence was therefore uni- tion, the concept of IQ reinforced the notion that
tary and largely determined by heredity. Hence intelligence was relatively constant and geneti-
everyone could, and should, be measured with cally fixed.
the Binet scale and assigned a place in society The Stanford-Binet represented a practi-
commensurate with their mental level. This was cal screening method and was enthusiastically
Goddard’s gospel, despite the fact that Binet him- received because it met a pressing need. Perhaps
self was opposed to the view of intelligence as a more importantly, the Stanford-Binet was proof
genetically fixed quantity. Goddard’s views were that mental measurement was possible, and thus
not unique. It was widely believed at that time it led to both the development of other psycho-
that the “feeble minded” were responsible for logical tests and to public acceptance of testing.
most social problems, that they reproduced at Terman is also well known for his longitudi-
an alarming rate, and that it was important to nal series of studies of California gifted-children,
restrict immigration especially from Southern known as the Genetic Studies of Genius (and its
and Eastern Europe (Gelb, 1986). participants affectionately dubbed “termites”),
Goddard is also remembered for his study of the development of a masculinity-femininity
the Kallikak family, an account of two lines of test, and a study of the psychological factors
descendants of Martin Kallikak (a pseudonym), involved in marital happiness. His work was
an American soldier in the Revolutionary War heavily grounded in empiricism and his careful
who impregnated a feebleminded bar maid and approach in many ways became a model to be
then, after the war, married a “good” girl. The emulated (Hilgard, 1957; Sokal, 1987).
differential incidence of criminals, prostitutes,
tubercular victims, and mental retardates in these Other adaptations of the Binet. In addition to
two family branches were taken by Goddard and the work of Goddard and Terman, there appeared
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 525

several revisions or adaptations of the Binet an awareness of the child’s specific strengths and
scales. Notable among these were the Kuhlmann- weaknesses. Thus, the child’s performance on the
Binet, one of the first preschool tests of intelli- various single tests was kept separate. The empha-
gence that made it possible to test children as sis was not on what score the child obtained, but
young as 3 months, and the Hayes-Binet, adapted on how the child approached the various tasks
for use with the blind. Mention might also be and what the child’s general behavior was. It was
made of the Yerkes Point Scale, a 20-item test (19 the process rather than the outcome that was of
of which came from the Binet) that was scored primary concern to Healy and Fernald. In keep-
in terms of points rather than in terms of mental ing with this aim, the scoring and administrations
age credit. of the tests were not spelled out precisely.
The series of tests was quite varied and
G. Stanley Hall (1846–1924). By 1910, testing included puzzles of various kinds, a “testimony”
had grown to the extent that G. M. Whipple test in which the child was shown a picture of
could publish a Manual of Mental and Physical a butcher shop, the picture was then removed,
Tests, discussing 54 available tests. The growth of and the child was asked to recall various details,
psychological testing was, in great part, a reflec- including an attempt to measure the degree to
tion of the general growth in American psychol- which the child yielded to suggestion; the tests
ogy. For example, by 1894 there were 24 psycho- also included a game of checkers, arithmetic
logical research laboratories at such universities problems, and drawing of designs from mem-
as Columbia, Chicago, Johns Hopkins, Stanford, ory. It is interesting to note that while the Binet
and Yale. test underwent several revisions and is still one
A major force in this growth was Granville of the most commonly used tests of intelligence,
Stanley Hall, a central figure in the establishment the Healy-Fernald series is now only of historical
of child psychology. Hall’s endeavors, reflected interest. Perhaps this was due to its subjectivity
in his writings and organizational undertakings, and lack of standardization which in many ways
were voluminous. He established several psycho- ran counter to the spirit of the time, so engrossed
logical journals, wrote extensively on a variety in quantifying mental processes.
of topics including adolescence, religious experi- In 1917, the Pintner-Patterson Performance
ence, and senescence. He organized and was first Scale was published. This test is of interest
president of the American Psychological Associ- because it represents the first major attempt to
ation, and in 1909 invited Freud and Jung to the standardize a test that required no language on
United States to help celebrate the 20th anniver- the part of either the examiner or the subject.
sary of Clark University; thus, he introduced the In the same year, Helen Boardman pub-
psychoanalytic movement to American psychol- lished a monograph titled, Psychological Tests, a
ogists. For the history of testing, however, his Bibliography; like Whipple’s efforts of 1910, this
major contributions were the development of an was an attempt to survey the field of psychological
extensive series of questionnaires covering topics tests. Boardman divided her references into two
such as fears, dreams, and foods, for the study areas, those that referred to the Binet-Simon and
of children’s thinking, and for his great teaching those that referred to other tests, mostly intelli-
influence – both Goddard and Terman, for exam- gence tests.
ple, were pupils of Hall (Averill, 1990; R. B. Evans
& J. B. Cohen, 1987; E. C. Sanford, 1987; Sokal, World War I and testing needs. The beginning
1990; Wapner, 1990). of World War I in 1917 created a pressing need for
mental tests as selection and placement devices;
The Healy-Fernald Tests. In 1911, William Healy there was need not only to screen out men whose
and Grace Fernald published a series of tests to intellectual capabilities were too limited for mil-
be used in the evaluation of delinquent children. itary service, but also to determine which men
Although the 1908 Binet-Simon scale was avail- could be given specialized training or admitted
able, Healy and Fernald were interested not sim- to officer programs. The answer to these needs
ply in obtaining a picture of the child’s general were two intelligence tests, the Army Alpha and
mental level as reflected by a total score, but also the Army Beta, the latter designed for illiterates
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

526 Part Five. Challenges to Testing

or nonEnglish speaking recruits. The Army Alpha I the use of psychological tests rapidly became
consisted of eight tests including a directions test an integral part of our culture. The Army Alpha
that required the examinee to follow oral direc- and Beta were released for general use and were
tions given by the examiner, a multiple-choice widely used in schools and some industrial set-
test of common sense where the examinee indi- tings. Many new tests were devised, including
cated, for example, why we use stoves, and what many tests to serve special needs such as the test-
to do if given too much change by the grocer, ing of the deaf or the measurement of personality
and an information test that asked the subject to characteristics. Testing, which prior to World War
identify the location of certain cities, the manu- I had been mainly confined to educational set-
facturer of shoes, and other cultural items. These tings, now found applications in other areas, and
tests held an advantage over the Stanford-Binet the role of the psychologist expanded to include
in that they could be administered to many sub- the assessment of human potential and abilities.
jects at one sitting; almost two million men were
tested (Von Mayrhauser, 1989).
A safe world. The first World War was the war
Another World War I instrument developed in
to end all wars. The world was then consid-
response to military, needs was the Woodworth
ered a safe place and anthropologists and other
Personal Data Sheet (Woodworth, 1920), a self-
social scientists took off for “exotic” parts of the
report inventory of 116 yes-no questions con-
world like Samoa and Egypt. In 1921, Florence
cerning neurotic symptoms, designed as a rough
Goodenough published her “draw-a-man” test,
screening device for identifying recruits with
an attempt to develop a method of intelligence
serious psychological difficulties, who would be
testing that would reflect a child’s intellectual
emotionally unsuited for military service. Sam-
maturity from the drawings of a man (Goode-
ple questions were: Do you feel tired most of
nough, 1926). Because the test required only a
the time? Did you have a happy childhood? Is
piece of paper and a pencil, could be adminis-
it easy to make you laugh? Do you often feel
tered in pantomime, it quickly became a favorite
miserable and blue? Basically, the Personal Data
cross-cultural tool. In the late 1930s and 1940s,
Sheet was a paper-and pencil version of a psychi-
this test was expanded to the measurement of
atric interview, although it also contained some
personality and became a popular tool of clinical
interesting empirical aspects: for example, items
and child psychologists.
endorsed by 25% or more of a normal sample
were omitted, and only symptoms that occurred
twice as frequently in a neurotic group than in The Rorschach. Also in 1921, Hermann
a normal group were included. Recruits who Rorschach (1884–1922), a Swiss psychiatrist,
endorsed many symptoms could then be further published a series of ten inkblots to be used as
assessed in a psychiatric interview. The Personal a diagnostic technique in the study of psychi-
Data Sheet subsequently became the prototype atric patients. Rorschach was not the first to
for many personality inventories dealing with investigate the use of inkblots as a tool to study
mental health and adjustment. Judged by mod- personality. Binet, for example, had already writ-
ern standards, the Personal Data Sheet was unso- ten on this topic. American psychologists such as
phisticated, lacked norms, and did not address Dearborn (1898), Sharp (1899), and Kirkpatrick
issues of reliability and validity; but it was suc- (1900) had also published studies dealing with
cessful, fulfilled a need, and provided a good role inkblots. But Rorschach’s contribution was a
model; it was the forerunner of some 100 such systematic and in-depth approach, so that the
inventories (Kleinmuntz, 1967). Rorschach Inkblot became rapidly popular
in the United States; several scoring systems
The postwar period. Although the practical were developed despite the fact that, in general,
contribution of these tests to the war effort was academic psychology criticized the Rorschach
minor (Samelson, 1977), the tremendous suc- “cult” for its lack of scientific discipline. As we
cess of the mass application of psychological tests saw, today the Rorschach is still one of the most
in World War I resulted in an upsurge of inter- widely used tests and the focal point of much
est in testing (Pastore, 1978). After World War research and controversy.
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 527

The Stanford Achievement Test. In 1923, Ter- who was impressed and offered him a position in
man and others published the Stanford Achieve- Edison’s New Jersey laboratory. Soon, Thurstone
ment Test, a battery of standardized tests to became more interested in educational than engi-
measure achievement in elementary-school sub- neering problems and returned to the academic
jects. The content of this test battery reflected world. He made many basic contributions to top-
a careful sampling of the actual material being ics such as factor analysis, and despite the fact that
taught in schools across the United States. As many of his papers deal with highly complex and
in the Stanford-Binet, the items had been care- esoteric statistical issues, his primary concern was
fully tested to determine their validity, and exten- with practical problems (see Thurstone, 1952, for
sive norms had been gathered. In the field of his autobiography).
educational testing, it was the Stanford Achieve-
ment Test that made large-scale testing a practical Arnold Gesell (1880–1961). In 1925, Arnold
reality. Gesell, a pupil of G. Stanley Hall, began pub-
lishing a series of developmental schedules for
The Downey Will Temperament Test. In 1923, infants and preschool children. These sched-
the Downey Will Temperament Test appeared, a ules were essentially an inventory of spontaneous
test requiring the subject to write under various and elicited behaviors that occurred in children’s
conditions, for example as fast as possible, or with motor, adaptive, language, and personal-social
eyes closed. This test attempted to measure vari- behavior. Given such an inventory, their time of
ous personality aspects such as flexibility, interest occurrence in a specific child could be compared
in detail, and reaction to contradiction, but most with the “typical” time of occurrence. The work
research studies that used this test obtained neg- of Gesell is well known and represents a pioneer-
ative results. The Downey test was an interesting ing effort in the field of infant testing. Subse-
but unsuccessful reflection of the behavioristic quent tests for young children have often used the
demand for “objectivity.” Gesell Schedules as one of their main sources of
Several tests of aptitude in the areas of material.
music, clerical skills, and mechanical ability also
appeared in the early 1920s and subsequently The Berkeley Growth Study. In 1928, Nancy
found wide application in industry and civil Bayley, a psychologist at the University of Califor-
service. nia (Berkeley) began the Berkeley Growth Study
(her doctoral dissertation) – a longitudinal study
Louis L. Thurstone (1887–1955). Largely as the of a sample of 61 newborns, devoted to the inves-
result of the work of L. L. Thurstone, at the tigation of mental and motor development and
University of Chicago, the global approach to physical growth and maturing. Each newborn
intelligence as reflected in the Stanford-Binet was studied intensively and was seen monthly for
and the Army Alpha and Beta, was comple- the first 15 months, and then at 3-month intervals
mented by a “primary mental abilities” approach until age 3 and less frequently after that. Men-
that attempted to develop tests to measure dis- tal and motor test scores, anthropometric mea-
tinct and basic mental abilities. Thus Spearman’s sures, observations of mother and child, inter-
approach, popular in Britain, was replaced in the view data, and projective test data were collected.
United States by the multiple-factor approach Reports from this study have been numerous and
(although the pendulum seems to be swinging have contributed much to our understanding of
back to primacy of the g factor – see A. R. Jensen, psychological development (Bayley, 1968; 1986;
1998). Thurstone also was a pioneer in the mea- 1991).
surement of attitudes, and his techniques for the
construction of attitude scales are still widely Other developments. Situational tests in the
used (see Chapter 6). study of personality were also developed. These
Thurstone was an engineer and inventor of a tests involved the close observation of a sub-
movie projector that eliminated the flicker of pro- ject engaged in a task whose purpose was often
jectors then in use by having a continuous mov- disguised. A classical example is the series of
ing film. He demonstrated this to Thomas Edison investigations conducted by Hartshorne and May
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

528 Part Five. Challenges to Testing

(1928) concerned with cheating, stealing, and In a 1939 paper, L. K. Frank introduced the
other behaviors in school children. label—projective techniques to refer to materials,
All of these developments reflected a rapid shift such as inkblots, which the subject can respond
in psychology from a pre-Wundtian philosoph- to in such a way that the responses are taken
ical orientation to a scientific one grounded in to reflect needs, motives, past experiences, etc. –
measurement. This shift was paralleled in the i.e., the individual presumably “projects” his or
field of education. Many bureaus of educational her personality into the perceptual organization.
research were established and their findings were As discussed in Chapter 15, projective techniques
rapidly disseminated. Surveys of instructional became quite popular, not only among clinical
efficiency were made and tests of intelligence and psychologists who used these techniques to assess
achievement became a routine part of school clients, but also among anthropologists and other
life. Various professional educational societies behavioral scientists, who saw in these techniques
were established, and articles in scientific journals a method to quantify observations across cul-
dealing with educational measurement became tures. By way of contrast, the use of tests in
more common. the Soviet Union was banned by a 1936 decree
Textbooks in the area of educational measure- (Brozek, 1972).
ment began to appear in 1916 and 1917 and were
quickly followed by others. In 1926, the Scholastic Nature vs. nurture. To be sure, the growth of
Aptitude Test was introduced as part of a college- testing was not without its pains. A large num-
admissions program. ber of tests eagerly placed on the market did
In 1935, C. D. Morgan and H. A. Murray pub- not live up to the user’s expectations and often
lished the Thematic Apperception Test, a series resulted in a negative attitude toward all testing.
of semistructured pictures to which the subject Many individuals with little or no psychologi-
responds by telling a story (see Chapter 15). cal training used complex instruments to make
The TAT rapidly joined the Rorschach in terms decisions that often were not valid. For example,
of research studies and number of controversies many school teachers were taught administration
that it generated (see W. G. Klopfer, 1973, for a of the Binet scale during one summer course and
brief history of projective techniques, and H. A. would then become official community examin-
Murray, 1967, for an autobiography; see also M. ers, frequently being labeled as psychologists. In
B. Smith, 1990, M. B. Smith & Anderson, 1989). 1914, surveys published by J. E. Wallin indicated
In 1935 also, Oscar K. Buros published a 44- that the majority of psychological examiners were
page bibliography of tests. In 1936, the list was teachers, principals, and even medical personnel,
expanded to 83 pages. In 1938, this list became most with very little training in psychology.
the First Mental Measurements Yearbook, some Acrimonious controversies over technical
400 pages long covering about 4,000 tests, the issues also did not help. A classical example
first volume of a series that contains not only is the Stanford-Iowa controversy over the con-
a compendium of commercially available tests, stancy of the IQ – i.e., the question of whether
but also test information such as price, number intelligence is a fixed personal attribute based
of forms available, etc., and critical reviews. on hereditary givens, or a characteristic that not
In 1938, Lauretta Bender presented the Visual only reflects one’s culture and upbringing but can
Motor Gestalt test, commonly known as the also be altered by environmental manipulations
Bender-Gestalt, that consists of nine simple such as coaching, nursery experiences, intensive
designs that the subject is instructed to copy. teaching, and others (see McNemar, 1940; R. L.
The rationale of this test was that perception is a Thorndike, 1940; Wellman, Skeels, & Skodak,
total integrative process, and the subject’s repro- 1940). Again, this was not a new question, but
ductions would reflect maturational level, per- one that went back to the time of Gallon (Fancher,
sonality organization, and pathological states. As 1983). In 1932, Beth Wellman of Iowa published
discussed in Chapter 15, the Bender-Gestalt has the first of a series of studies reporting marked
found wide use both as an indicator of possible changes in the IQs of children attending the Iowa
brain damage and as a reflection of psychody- University Elementary School. The evidence to
namic aspects. support this contention consisted in mean IQ
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 529

increases from 110 to 119 and 124 on subse- ical psychologists. These demands were further
quent testing of a large group of children. Sub- intensified by the fact that, at the end of the
sequent reports by Wellman and her co-workers War, more than 50% of the patients in Veterans
presented additional evidence that the stimula- Administration Hospitals were neuropsychiatric
tion in a child’s environment is an important fac- patients. Clinical psychologists began to perform
tor in mental development (see Minton, 1984). services such as psychotherapy, which previously
Criticisms of the Iowa studies were swift and had been restricted to medical personnel. Testing
often violent. B. R. Simpson published an article played a greater part in the clinical psychologists’
in 1939, entitled “The wandering IQ: Is it time activities, although often a wide gulf developed
to settle down?” in which he bitterly denounced between the use of a particular test and experi-
the work of Wellman as “worse than nonsense.” mental proof of its validity.
Florence Goodenough, also in 1939, published “A The Army Alpha of the first world war was
critique of recent experiments on raising the IQ,” replaced by the Army General Classification Test
a thoughtful evaluation based on the more ratio- that contained vocabulary, arithmetic, reason-
nal arguments of unreliability, examiner bias, and ing, and block-counting items (Harrell, 1992).
statistical regression. Noteworthy also, was the extensive use of situa-
tional tests made by the Office of Strategic Ser-
David Wechsler. In 1939, David Wechsler (1896– vices (now called the Central Intelligence Agency)
1981) pointed out that available adult tests of in their program designed to select spies, sabo-
intelligence had typically been developed from teurs, and other military intelligence personnel
children’s tests, and therefore their content was (OSS Assessment Staff, 1948).
often inappropriate for adults. He also indicated
that available tests overemphasized speed and
The MMPI. In 1940, Starke Hathaway, a psychol-
verbal abilities, and their standardization rarely
ogist, and J. C. McKinley, a psychiatrist, at the
included adults. To correct these and other weak-
University of Minnesota presented the Minnesota
nesses, Wechsler created the Wechsler-Bellevue
Multiphasic Personality Inventory, a collection
Intelligence Scale, an individual point-scale for
of 550 items to aid in the diagnosis of clinical
adults. The Wechsler-Bellevue scale won wide
patients. The MMPI was instrumental in creating
acceptance and was soon ranked second in fre-
and fostering the profession of clinical psychol-
quency of use, following the Stanford-Binet. In
ogy because it was seen as evidence that psychol-
particular, the Wechsler-Bellevue found great
ogists could not only diagnose but also provide
use in military hospitals during World War II.
therapy, functions which up till then were the
Ten years later Wechsler published the Wechsler
province of psychiatry. The MMPI resulted in a
Intelligence Scale for Children, and in 1955 the
veritable deluge of research publications, many
Wechsler-Bellevue was replaced by the Wechsler
of them aimed at developing new MMPI scales,
Adult Intelligence Scale (see Chapter 5).
or applying standard scales to a wide variety of
constructs. This intensity of research, as well as
World War II. World War II also made exten-
the applicability of the MMPI to a wide range of
sive demands on the skills and ingenuity of
problems, made the MMPI the best-known per-
psychologists and further stimulated psycholog-
sonality inventory.
ical testing. The successful placement of person-
nel into specialized war activities, such as radar
observers, airplane navigators, and radio opera- Postwar period. At the end of World War II, the
tors, became a crucial goal that generated much federal government actively supported the train-
systematic and sophisticated research. Problems ing of clinical psychologists. This resulted both
of adjustment, morale, and psychopathology also in great interest in psychological tests designed to
stimulated interest in testing, particularly in the measure the “inner” aspects of personality, and
use of projective techniques as tools for clinical in a reaction against the identification of clin-
psychologists. The high rate of draftee difficul- ical psychology as applied psychometrics. The
ties emphasized the severity of the mental-health young clinicians felt they were not “just testers”
problem and made greater demands upon clin- and often disavowed any connections with tests
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

530 Part Five. Challenges to Testing

in their attempts at creating a professional known and most heatedly debated personality
“doctor” image. scales, with much of the focus on methodological
On the other hand, some psychologists pro- issues.
duced evidence that tests could be a valu-
able avenue to the understanding of personal- The three faces of intellect. J. P. Guilford and
ity dynamics. Of note here is the work of David his coworkers at the University of Southern Cal-
Rapaport and his collaborators (notably Merton ifornia developed a unified theory of the human
Gill and Roy Schafer) at the Menninger Foun- intellect that organizes various intellectual abil-
dation. In their 1945 publication Diagnostic Psy- ities into a single system called the structure of
chological Testing, they presented a unified clinical intellect (Guilford, 1959a; 1967b). As discussed
diagnostic approach, demonstrating that psycho- in Chapters 5 and 8, Guilford’s model is a three-
logical tests could contribute to our understand- dimensional one that classifies human intellec-
ing of personality functioning. tual abilities according to the type of mental oper-
The federal government also supported voca- ations or processes involved, the kind of content,
tional and educational training, with the result and the type of outcome or product. Given five
that the profession of guidance and counseling categories of operations, four kinds of content,
made heavy demands on the development and and six kinds of products, the theory can be rep-
use of vocational tests, tests of aptitude, interests, resented by a cube subdivided into 120 (4 × 5 × 6)
personality, and intelligence. smaller cubes. Based on this model and a factor-
analytic approach, Guilford and his coworkers
developed a large number of tests, each designed
SOME RECENT DEVELOPMENTS
to be a pure measure of a particular intellec-
The development and application of psychologi- tual ability. Under the operation dimension of
cal tests has increased enormously; they are rou- divergent thinking, a number of measures have
tinely used in education, the military, civil ser- been developed and used in the area of creativity,
vice, industry, even religious life. The number despite their limited validity (see Guilford, 1967a,
of tests available commercially is truly gigantic, for his autobiography).
but probably surpassed by noncommercial tests.
Large-scale testing involving thousands of sub- The 1960s. The launching of the Soviet Sput-
jects are relatively common. Complex statisti- nik in October 1957 brought to the conscious-
cal analyses, not possible before computers, have ness of the American public the possibility that
now become routine matters. It is somewhat dif- the United States was no longer number one,
ficult to single out recent contributions as histor- at least in some areas of endeavor. As always,
ically noteworthy because the objectivity of time the high-school educational system was criti-
has not judged their worth, but the following are cized and a flurry of legislative actions resulted
probably illustrative. in a search for talented students who could once
again bring the United States to the forefront of
The Authoritarian Personality. One of the major science and technology. These searches involved
issues of concern to social scientists, follwing large-scale testing, exemplified by the National
World War II, was to try to understand how Merit Scholarship Corporation program. Such
and why the Nazi movement had taken place, programs were facilitated by various technolog-
and whether such a movement might be pos- ical advances that permitted machine scoring of
sible in the United States. The answer was a large number of answer sheets. This in turn facili-
book published in 1950 called The Authoritar- tated the use of multiple-choice items, as opposed
ian Personality. This psychoanalytically oriented to essay questions, which could not be scored by
investigation identified allegiance to authori- machine.
tarianism as the culprit; this complex person- The result was dual. On the one hand, teams of
ality syndrome resulted from rigid and puni- experts devoted their considerable talents to the
tive child-rearing practices (Baars & Scheepers, development of multiple-choice items that would
1993). The book also presented the F (fascist) assess more than memory and recognition. On
scale, a self-inventory designed to assess author- the other hand, a number of critics protested
itarianism. The F scale became one of the best rather vocally against these tests, particularly
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 531

the ones used to make admission decisions in Jensen very carefully reviewed the literature and
higher education. One of the most vocal critics concluded that differences in intelligence and
was Banesh Hoffman who, in a 1962 book titled scholastic achievement within white children
The Tyranny of Testing, argued that multiple- were primarily attributable to genetic aspects.
choice tests were not only superficial but penal- He then concluded that the differences in intel-
ized the brighter and creative students who, ligence between black and white children were
rather than select the simplistically correct choice also the result of genetic differences between the
defined by the test constructor, would select two races. The conclusion was the product of a
incorrect options because they were able to per- scholarly mind and was based on a careful review
ceive more complex relationships between ques- of some rather complex issues, but it was cer-
tion and answer. tainly an unpopular conclusion that resulted in a
The 1960s also saw a major controversy over barrage of criticisms, not only of Jensen’s article,
the use of personality tests in the selection of per- but also of psychological testing. Once again, it
sonnel, especially, in the use of the MMPI for became clear that the enterprise of psychologi-
assessment of Peace Corps volunteers. This led cal testing is intertwined with societal, political,
to a Congressional investigation of psychological and philosophical values, and that their use tran-
testing, a special issue of the American Psycholo- scends the application of a scientific instrument.
gist (November, 1965) devoted to this topic, and Psychological testing, especially intelligence test-
a proliferation of books strongly critical of psy- ing, has been closely entwined in the United
chological testing [e.g., Hillel Black (1962) They States with racism (A. J. Edwards, 1971; S. J.
Shall Not Pass; Martin Gross (1962) The Brain Gould, 1981; Urban, 1989). The entire contro-
Watchers]. While much of the criticism gener- versy associated with intelligence testing is still
ated more smoke than heat and was accompa- unresolved and still very much with us. In 1994,
nied more by irrational conclusions than empir- Richard Herrnstein and Charles Murray pub-
ical evidence, the antitest movement, like most lished The Bell Curve, a book subtitled “Intel-
movements, had some benefits. For one, psychol- ligence and class structure in American life,” and
ogists became more aware of the issue of privacy, once again reignited a century-old controversy
and the need to communicate to and instruct (Zenderland, 1997).
the lay public as to the nature and limitations of
psychological testing. For another, it made psy- The 1970s: Educational testing. Another major
chologists more critical of their tests and there- development of the 1970s concerned three
fore more demanding of evidence to support the aspects of educational testing (Haney, 1981;
validity of such tests. Resnick, 1982). The first concerned a decline
from 1963 to 1977 of average SAT verbal scores by
The 1970s: The IQ controversy. For many years, nearly 50 points, and of SAT math scores by about
it had been observed and documented that 30 points. A variety of causes for the decline were
minority children, especially black children, did discussed, with the result that once again testing
not do as well in school as their Anglo coun- was brought to the public’s consciousness.
terparts. In an effort to redress this, the Federal A second development was minimum com-
Government had begun a massive program, the petency testing, the specification of standards to
Head Start program, designed to provide minor- be achieved to be promoted from one grade to
ity children remedial and enrichment opportu- another and to graduate from high school (B.
nities to given them a compensatory headstart Lerner, 1981). Although the idea seemed simple
in the scholastic race. Although the program was and popular with the American public, imple-
received enthusiastically, a number of critics sug- mentation was not as easy, and it contained
gested that such interventions were of little value a number of legal ramifications. For example,
(Zigler & Muenchow, 1992). in Florida such a testing program was declared
In 1969, Arthur Jensen, a psychologist at unconstitutional because 20% of blacks but only
the University of California at Berkeley, pub- 2% of white high-school seniors would have
lished an article in the Harvard Educational failed such standards, despite the fact that the test
Review titled, “How much can we boost IQ was judged to have adequate content and con-
and scholastic achievement?” In this article, struct validity (Haney, 1981).
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

532 Part Five. Challenges to Testing

The third development was the truth-in- 4. Studies from cognitive psychology, especially
testing legislation. Just as there are a variety of how information from each test item is processed,
laws that protect the consumer when a product will provide us with much information on cog-
is purchased or used, so a number of legisla- nitive abilities, new theoretical models to guide
tive efforts were made, particularly in relation our thinking, as well as new types of test items
to admissions tests for professional and graduate and test forms.
programs, to allow the test taker to have access to 5. Neuropsychological tests, along the lines of
his or her test results, to have publicly available the Halstead-Reitan, will be much more sophis-
information on the psychometric aspects of the ticated and will provide complex analyses of the
test, and to give full disclosure to the test taker processes that underlie various cognitive func-
prior to testing as to what use will be made of the tions.
test scores. The basic rationale for such legisla- 6. There will be even wider use of personality
tion was to permit the test taker the opportunity type tests to assess various phenomena discussed
to know explicitly the criteria by which decisions in Chapter 8 (normal positive functioning) and
are made about that individual; public disclosure Chapter 15 (health psychology), such as Type A
would presumably make the test publisher and behavior, hassles in everyday living, and opti-
test administrator more accountable and, hence, mism toward one’s own health.
more careful. The intent was good, but the results,
7. There will be more tests designed to assess pos-
it may be argued, punished test takers by resulting
itive functioning and abilities such as personal
in higher fees and fewer administrative dates (for
competence and quality of life, rather than the
a fascinating chronology of noteworthy events in
past emphasis on deficits.
American psychology, many relevant to testing,
see Street, 1994). 8. Computers and technology will play a major
The 1980s also saw a number of major attacks role in testing, from administration to interpre-
of both specific tests and psychological testing tation.
more generally, with the publication of books 9. New statistical techniques, and in particular
such as The Reign of ETS: The Corporation that the Item Response Theory model, will result in
Makes Up Minds (Nairn & Associates, 1980), The drastically new approaches. For example, one of
Testing Trap (Strenio, 1981), None of the Above: the commandments of psychometric testing has
Behind the Myth of Scholastic Aptitude (D. Owen, always been that all subjects taking test X must
1985). respond in the same way – if test X is a 100-
item multiple-choice vocabulary test, we do not
let John answer some items and Tom answer dif-
A peek at the future. Following the example ferent items. In the future this will change (it has
of Matarazzo (1992), I can hazard eight guesses already through adaptive testing) and not only
about testing in the 21st century: will we be able to compare John’s performance
with Tom’s, even though based on different sub-
1. Current tests that are popular and have sur- sets of items, but also allow each to answer what-
vived the rigors of scientific critique, will still be ever items they wish, and to use their choice in
with us. These should include tests such as the useful predictive ways.
Stanford-Binet, the Wechsler series, and yes, even 10. Testing has been dominated by multiple-
the MMPI. choice items. Such items can be and are very use-
2. Testing will be marked by a broader sensitivity ful and have been unjustly maligned. However,
to the individual aspects of the client. For exam- their dominance in the future will recede as other
ple, versions of the tests will be available in a forms of tests such as portfolios, essay exams, and
wide variety of languages, such as Vietnamese virtual reality situations are developed and their
and Croatian. utility explored.
3. There will be much greater use of physiolog-
ical measures of intelligence, using a variety of
biological indices such as reaction time (shades A final word. Perhaps the obvious needs to be
of Wundt!), velocity of nerve conduction, and the stated. Psychologists themselves have been the
rate at which glucose is metabolized in the brain. most critical of all about psychological testing;
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

The History of Psychological Testing 533

if you don’t believe me simply browse through Personality. Journal of the History of the Behavioral
the test reviews found in the Mental Measure- Sciences, 29, 345–353.
ments Yearbook! A reflection of their criticality A fascinating review by two Dutch social scientists of the back-
was the development of the Standards for edu- ground, theoretical ideas, and methodological contributions
that resulted in the classic study of the “Authoritarian Person-
cational and psychological tests (see Chapter 1), a
ality.”
code of recommended practices and ethical prin-
ciples involved in testing. Another reflection is Buchanan, R. D. (1994). The development of the Min-
the steady stream of articles published in profes- nesota Multiphasic Personality Inventory. Journal of
sional journals that are critical of either specific the History of the Behavioral Sciences, 30, 148–161.
tests, aspects of testing, or the whole totality of A review of how the MMPI came to be. Very readable and an
interesting look at a chapter in the history of psychology that
testing. in many ways revolutionized psychological testing.
In the 1960s and 1970s, a number of graduate
programs in psychology began to deemphasize Dennis, P. M. (1984). The Edison Questionnaire. Jour-
testing in the training of future psychologists, and nal of the History of the Behavioral Sciences, 20, 23–37.
for some, such deemphasis reached crisis propor- Thomas A. Edison, the famous inventor, developed a ques-
tionnaire of 48 questions to be administered to applicants
tions and hindered the advancement of psychol- for positions as industrial chemists in his laboratories. The
ogy as a science (Aiken et al., 1990; N. M. Lam- questions included: “What countries bound France?” “What
bert, 1991; Meier, 1993). The pendulum seems to is copra?” and “Who was Plutarch?” Although now unknown,
be swinging in the other direction; Time maga- the questionnaire actually had quite an impact on the public’s
perception of psychological testing.
zine, for example, reported (July 15, 1991) that
in the United States, 46 million students from Landy, F. J. (1992). Hugo Munsterberg: Victim or
kindergarten through high school are subjected visionary? Journal of Applied Psychology, 77, 787–802.
to more than 150 million standardized tests each Munsterberg was an early leader in the applied study of psy-
year. For better or for worse, psychological test- chology and was, indeed, one of the pioneers in the develop-
ment of psychology in the United States. Yet in many ways he
ing is a part of our everyday life, and a num- was, and is, an obscure figure.
ber of recent advances, from the development of
more sophisticated psychometric models to the Matarazzo, J. D. (1992). Psychological testing and
ubiquity of microcomputers, will probably make assessment in the 21st century. American Psychologist,
47, 1007–1018.
testing an even more important endeavor. With
education and a firm understanding of both the The author, a very distinguished psychologist, looks into the
crystal ball and gives us a glimpse of the future of testing.
potential and the limitations of tests, we can use
tests judiciously as part of an effort to create a
better world. I hope this textbook has provided
the beginnings of such a foundation. DISCUSSION QUESTIONS

1. From a historical perspective which one per-


SUMMARY son had the most impact on the development of
We have taken a quick look at the history of psy- psychological testing as a field?
chological testing and a very brief peek at its 2. If you could invite one person from those men-
potential future. Testing has a rather long past, tioned in the chapter to visit your class, which one
perhaps because it seems to be an intrinsic human would it be?
activity. But testing today seems to have evolved 3. If you are familiar with the musical “My Fair
primarily due to the French clinical tradition and Lady” (or the book Pygmalion on which it is
the work of Binet, the German nomothetic tra- based) you might want to discuss the major
dition as represented by the work of Wundt, the themes as they applied to England at the time
British idiographic approach, especially that of of Sir Francis Galton.
Sir Francis Galton, and the American emphasis 4. Could you take the work of Galton and
on pragmatism. of Goddard and argue for an environmental
explanation?
SUGGESTED READINGS 5. Are there any recent events that in the future
Baars, J., & Scheepers, P. (1993). Theoretical and may be incorporated in a chapter on the history
methodological foundations of the Authoritarian of psychological testing?
P1: JZP
0521861810c19 CB1038/Domino 0 521 86181 0 February 24, 2006 9:37

534
P1: JZP
0521861810apx CB1038/Domino 0 521 86181 0 March 4, 2006 14:23

APPENDIX

Table to Translate Difficulty Level of a Test


Item into a z Score

Difficulty level (area Equivalent z Difficulty level (area Equivalent


to the right) score to the right) z score
99% −2.33 50% .00
98 −2.05 49 +.03
very 97 −1.88 48 +.05
easy 96 −1.75 47 +.08
items 95 −1.65 46 +.10
94 −1.56 45 +.13
93 −1.48 44 +.15
92 −1.41 43 +.18
91 −1.34 42 +.20
90 −1.28 41 +.23
89 −1.23 40 +.25
88 −1.17 39 +.28
87 −1.13 38 +.31
86 −1.08 37 +.33
85 −1.04 36 +.36
84 −.99 35 +.39
83 −.95 34 +.41
82 −.92 33 +.44
81 −.88 32 +.47
80 −.84 31 +.50
79 −.81 30 +.52
78 −.77 29 +.55
77 −.74 28 +.58
76 −.71 27 +.61
75 −.67 26 +.64
74 −.64 25 +.67

535
P1: JZP
0521861810apx CB1038/Domino 0 521 86181 0 March 4, 2006 14:23

536 Appendix

Difficulty level (area Equivalent z Difficulty level (area Equivalent


to the right) score to the right) z score
73 −.61 24 +.71
72 −.58 23 +.74
71 −.55 22 +.77
70 −.52 21 +.81
69 −.50 20 +.84
68 −.47 19 +.88
67 −.44 18 +.92
66 −.41 17 +.95
65 −.39 16 +.99
64 −.36 15 +1.04
63 −.33 14 +1.08
62 −.31 13 +1.13
61 −.28 12 +1.17
60 −.25 11 +1.23
59 −.23 10 +1.28
58 −.20 very 9 +1.34
57 −.18 difficult 8 +1.41
56 −.15 items 7 +1.48
55 −.13 6 +1.56
54 −.10 5 +1.65
53 −.08 4 +1.75
52 −.05 3 +1.88
51 −.02 2 +2.05
1 +2.33

Note: Because of rounding error and lack of interpolation, the values are approximate, and may not
match normal curve parameters (e.g., the value for 84% is given as .99 rather than the expected 1).
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References

Aaronson, N. K., Acquadro, C., Alonso, J., et al. (1992). Aiken, L. R. (1980). Problems in testing the elderly.
International Quality of Life Assessment (IQOLA) Educational Gerontology, 5, 119–124.
project. Quality of Life Research, 1, 349–351. Aiken, L. R. (1987). Assessment of intellectual function-
Abdel-Khalek, A. (1988). Egyptian results on the Stan- ing. Boston, MA: Allyn & Bacon.
dard Progressive Matrices. Personality and Individ- Aiken, L. S., West, S. G., Sechrest, L., Reno, R. R.,
ual Differences, 9, 193–195. Rodiger, H. L. III, Scarr, S., Kazdin, A. E., & Sher-
Abramson, P. R., & Mosher, D. L. (1975). Development man, S. J. (1990). Graduate training in statistics,
of a measure of negative attitudes toward masturba- methodology, and measurement in psychology: A
tion. Journal of Consulting and Clinical Psychology, survey of PhD programs in North America. Ameri-
43, 485–490. can Psychologist, 45, 721–734.
Achenbach, T. M. (1991). Manual for the Child Behav- Airasian, P. W. (1989). Review of the California
ior Checklist and 1991 profile. Burlington, VT: Uni- Achievement Tests. In J. C. Conoley & J. J. Kramer
versity of Vermont Department of Psychiatry. (Eds.), The tenth mental measurements yearbook
Achenbach, T. M., & Edelbrock, C. S. (1981). Behav- (pp. 126–128). Lincoln, NE: University of Nebraska
ioral problems and competencies reported by par- Press.
ents of normal and disturbed children aged 4 Ajzen, I., & Fishbein, M. (1973). Attitudinal and nor-
through 16. Monographs of the Society for Research mative variables as predictors of specific behav-
in Child Development, 46, Serial Nos 1 & 188. iors. Journal of Personality and Social Psychology, 27,
Ackerman, P. L. (1992). Predicting individual differ- 41–57.
ences in complex skill acquisition: Dynamics of abil- Ajzen, I., & Fishbein, M. (1980). Understanding atti-
ity determinants. Journal of Applied Psychology, 77, tudes and predicting social behavior. Englewood
598–614. Cliffs, NJ: Prentice Hall.
Adams, K. M., & Heaton, R. K. (1985). Automated int- Akutagawa, D. A. (1956). A study in construct validity
erpretation of neuropsychological test data. Journal of the psychoanalytic concept of latent anxiety and
of Consulting and Clinical Psychology, 53, 790–802. a test of projection distance hypothesis. Unpublished
Adams, K. M., & Heaton, R. K. (1987). Computerized doctoral dissertation, University of Pittsburgh.
neuropsychological assessment: Issues and applica- Albaum, G., & Baker, K. (1977). Cross-validation of
tions. In J. N. Butcher (Ed.), Computerized psycho- a creativity scale for the Adjective Check List. Edu-
logical assessment. (pp. 355–365). New York: Basic cational and Psychological Measurement, 37, 1057–
Books. 1061.
Adams, K. M., Kvale, V. I., & Keegan, J. F. (1984). Albert, S., Fox, H. M., & Kahn, M. W. (1980). Fak-
Relative accuracy of three automated systems for ing psychosis on the Rorschach: Can expert judges
neuropsychological interpretation. Journal of Clini- detect malingering? Journal of Personality Assess-
cal Neuropsychology, 6, 413–431. ment, 44, 115–119.
Adler, R., MacRitchie, K., & Engel, G. L. (1971). Psy- Albright, L. E., & Glennon, J. R. (1961). Personal his-
chologic process and ischemik stroke. Psychosomatic tory correlates of physical scientists’ career aspira-
medicine, 33, 1–29. tions. Journal of Applied Psychology, 45, 281–284.
Ado Rno, T. W., Frenkel-Brunswick, E., Levinson, Albright, L. E., Glennon, J. R., & Smith, W. J. (1963).
D. J., & Sanford, R. N. (1950). The Authoritarian The use of psychological tests in industry. Cleveland:
Personality. New York: Harper & Row. Howard Allen Inc.

537
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

538 References

Alderman, A. L., & Powers, P. E. (1980). The American Association on Mental Deficiency. (1974).
effects of special preparation on SAT verbal Adaptive Behavior Scale: Manual. Washington, DC:
scores. American Educational Research Journal, 17, Author.
239–251. American Educational Research Association, Ameri-
Alderton, D. L., & Larson, G. E. (1990). Dimen- can Psychological Association, and National Coun-
sionality of Raven’s Advanced Progressive Matrices cil on Measurement in Education. (1999). Stan-
items. Educational and Psychological Measurement, dards for educational and psychological testing.
50, 887–900. Washington, DC: American Educational Research
Alfred, K. D., & Smith, T. (1989). The hardy Association.
personality: Cognitive and physiological responses American Guidance Service, Inc. (1969). The Min-
to evaluative threat. Journal of Personality and Social nesota Rate of Manipulation Tests, examiner’s man-
Psychology, 56, 257–266. ual. Circle Pines, MN: Author.
Allen, J. P., & Litten, R. Z. (1993). Psychometric American Psychological Association. (1986). Guide-
and laboratory measures to assist in the treat- lines for computer-based tests and interpretations.
ment of alcoholism. Clinical Psychology Review, 13, Washington, DC: Author.
223–239. American Psychological Association. (1987). General
Allen, J. P., & Mattson, M. E. (1993). Psychomet- guidelines for providers of psychological services.
ric instruments to assist in alcoholism treatment American Psychologist, 42, 712–723.
planning. Journal of Substance Abuse Treatment, 10, American Psychological Association. (1992). Ethical
289–296. principles of psychologists and code of conduct.
Allen, M. J., & Yen, W. M. (1979). Introduction to mea- American Psychologist, 47, 1597–1611.
surement theory. Monterey, CA: Brooks-Cole. American Psychological Association, Task Force on the
Allen, R. M., & Jefferson, T. W. (1962). Psychological Delivery of Services to Ethnic Minority Populations.
evaluation of the cerebral palsied person. Springfield, (1993). Guidelines for providers of psychological
IL: Charles C Thomas. services to ethnic, linguistic, and culturally diverse
Alliger, G. M., & Dwight, S. S. (2000). A metaanalytic populations. American Psychologist, 48, 45–48.
investigation of the susceptibility of integrity tests to American Psychological Association Committee on
faking and coaching. Educational and Psychological Psychological Tests and Assessment. (1996). State-
Measurement, 60, 59–72. ment on the disclosure of test data. American Psy-
Allport, G. W. (1937). Personality: A psychological inter- chologist, 51, 644–648.
pretation. New York: Holt, Rinehart and Winston. Ames, L. B., Gillespie, B. S., Haines, J., & Ilg, F. L. (1979).
Allport, G. W. (1961). Pattern and growth in personal- The Gesell Institute’s child from one to six: Evaluating
ity. New York: Holt, Rinehart and Winston. the behavior of the preschool child. New York: Harper
Allport, G. W., & Odbert, H. S. (1936). Trait-names: A & Row.
psycholexical study. Psychological Monographs, 47, Anastasi, A. (1983). Evolving trait concepts. American
(1, whole No. 211). Psychologist, 38, 175–184.
Allport, G. W., Vernon, P. E., & Lindzey, G. (1960). Anastasi, A. (1988). Psychological testing (6th ed.). New
Study of values (3rd ed.). Boston: Houghton- York: Macmillan.
Mifflin. Anastasi, A., & Schaefer, C. E. (1969). Biographical cor-
Altemeyer, B. (1981). Right-wing authoritarianism. relates of artistic and literary creativity in adolescent
Manitoba, Canada: University of Manitoba Press. girls. Journal of Applied Psychology, 53, 267–273.
Altepeter, T. S., & Johnson, K. A. (1989). Use of the Anderson, G. J., & Walberg, H. J. (1976). The assessment
PPVT-R for intellectual screening with adults: A of learning environments. Chicago, IL: University of
caution. Journal of Psychoeducational Assessment, 7, Illinois.
39–45. Anderson, R. D., & Sisco, F. H. (1977). Standardiza-
Alter, J. B. (1984). Creativity profile of university and tion of the WISC-R for deaf children (series T, No.
conservatory dance students. Journal of Personality 1). Washington, DC: Gallaudet College, Office of
Assessment, 48, 153–158. Demographic Studies.
Alter, J. B. (1989). Creativity profile of university and Anderson, R. T., Aaronson, N. K., & Wilkin, D. (1993).
conservatory music students. Creativity Research Critical review of the international assessments of
Journal, 2, 184–195. health-related quality of life. Quality of Life Research,
Alvino, J., McDonnel, R. C., & Richert, S. (1981). 2, 369–395.
National survey of identification practices in gifted Andrews, J. (1973). The relationship of values to iden-
and talented education. Exceptional Children, 48, tity achievement status. Journal of Youth and Ado-
124–132. lescence, 2, 133–138.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 539

Andrews, L. W., & Gutkin, T. B. (1991). The effects of the Raven’s Advanced Progressive Matrices.
of human vs. computer authorship on consumers’ Educational and Psychological Measurement, 53,
perceptions of psychological reports. Computers in 471–478.
Human Behavior, 7, 311–317. Asher, S. R., Singleton, L. C., Tinsley, B. R., & Hymel, S.
Angoff, W. H. (1971). Scales, norms, and equiva- (1979). A reliable sociometric measure for preschool
lent scores. In R. L. Thorndike (Ed.), Educational children. Developmental Psychology, 15, 443–444.
Measurement (pp. 508–600). Washington, DC: Association of American Medical Colleges (1977). New
American Council on Education. Medical College Admission Test interpretive manual.
Aplin, D. Y. (1993). Psychological evaluation of adults Washington, DC: Association of American Medical
in a cochlear implant program. American Annals of Colleges.
the Deaf , 138, 415–419. Astin, A. W. (1982). Minorities in American higher edu-
Arbisi, P. A., & Ben-Porath, Y. S. (1998). The ability cation. San Francisco: Jossey-Bass.
of Minnesota Multiphasic Personality Inventory- Astin, A. W., & Holland, J. L. (1961). The environmen-
2 validity scales to detect fakebad responses in tal assessment technique: A way to measure college
psychiatric inpatients. Psychological Assessment, 10, environments. Journal of Educational Psychology, 52,
221–228. 308–316.
Archer, R. P., Gordon, R. A., & Kirchner, F. H. (1987). Atkin, R. S., & Conlon, E. J. (1978). Behaviorally
MMPI response-set characteristics among adoles- anchored rating scales: Some theoretical issues.
cents. Journal of Personality Assessment, 51, 506–516. Academy of Management Review, 3, 119–128.
Arellano, C. M., & Markman, H. J. (1995). The Man- Atkinson, D., & Gim, R. (1989). Asian-American cul-
aging Affect and Differences Scale (MADS): A self- tural identity and attitudes toward mental health
report measure assessing conflict management in services. Journal of Counseling Psychology, 36, 209–
couples. Journal of Family Psychology, 9, 319–334. 212.
Argulewicz, E. N., Bingenheimer, L. T., & Anderson, Atkinson, L. (1992). Concurrent validities of the Stan-
C. C. (1983). Concurrent validity of the PPVT-R for ford Binet (4th ed.), Leiter, and Vineland with devel-
Anglo-American and Mexican-American students. opmentally delayed children. Journal of School Psy-
Journal of Psychoeducational Assessment, 1, 163–167. chology, 30, 165–173.
Arizmendi, T., Paulsen, K., & Domino, G. (1981). Austin, J. S. (1992). The detection of fake good and fake
The Matching Familiar Figures Test: A primary, sec- bad on the MMPI-2. Educational and Psychological
ondary, and tertiary evaluation. Journal of Clinical Measurement, 52, 669–674.
Psychology, 37, 812–818. Averill, L. A. (1990). Recollections of Clark’s G. Stanley
Armstrong, R. G. (1955). A reliability study of a short Hall. Journal of the History of the Behavioral Sciences,
form of the WISC vocabulary subtest. Journal of 26, 125–130.
Clinical Psychology, 11, 413–414. Axelrod, S., & Eisdorfer, C. (1961). Attitudes toward
Arnold, M. B. (1962). Story sequence and analysis. New old people: An empirical analysis of the stimulus-
York: Columbia University Press. group validity of the Tuckman-Lorge Question-
Aronow, E., & Reznikoff, M. (1976). Rorschach content naire. Journal of Gerontology, 16, 75–80.
interpretation. New York: Grune & Stratton. Aylward, E. H. (1991). Understanding children’s testing.
Aronow, E., Reznikoff, M., & Moreland, K. L. (1995). Austin, TX: Pro-ed.
The Rorschach: Projective technique or psycho- Aylward, E. H., & Schmidt, S. (1986). An examination
metric test? Journal of Personality Assessment, 64, of three tests of visual-motor integration. Journal of
213–228. Learning Disabilities, 19, 328–335.
Arrasmith, D. G., Sheehan, D. S., & Applebaum, Baars, J., & Scheepers, P. (1993). Theoretical and
W. R. (1984). A comparison of the selected-response methodological foundations of the authoritarian
strategy and the constructed-response strategy for personality. Journal of the History of the Behavioral
assessment of a third-grade writing task. Journal of Sciences, 29, 345–353.
Educational Research, 77, 172–177. Back, K. W. (1971). Metaphors as test of personal phi-
Arthur, D. (1994). Workplace testing. New York: Amer- losophy of aging. Sociological Focus, 5, 1–8.
ican Management Association. Bachman, J. G., & O’Malley, P. M. (1984). Yea-saying,
Arthur, W., Jr., & Day, D. V. (1994). Development of nay-saying, and going to extremes: Black-white dif-
a short form for the Raven Advanced Progressive ferences in response styles. Public Opinion Quar-
Matrices Test. Educational and Psychological Mea- terly, 48, 491–509.
surement, 54, 394–403. Baehr, M. E. (1953). A simplified procedure for
Arthur, W., Jr., & Woehr, D. J. (1993). A confirmatory the measurement of employee attitudes. Journal of
factor analytic study examining the dimensionality Applied Psychology, 37, 163–167.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

540 References

Baehr, M. E., Saunders, D. R., Froemel, E. C., & Fur- cial behavior. Educational and Psychological Mea-
con, J. E. (1971). The prediction of performance for surement, 51, 271–286.
black and for white police patrolmen. Professional Baldessarini, R. J., Finkelstein, S., & Arana, G. W.
Psychology, 2, 46–58. (1983). The predictive power of diagnostic tests and
Bagarozzi, D. (1991). The Family Apperception Test: the effects of prevalence of illness. Archives of General
A review. American Journal of Family Therapy, 19, Psychiatry, 40, 569–573.
177–181. Baldwin, A. L., Cole, R. E., & Baldwin, C. (1982).
Bagby, R. M., Gillis, J. R., & Dickens, S. (1990). Detec- Parental pathology, family interaction, and the com-
tion of dissimulation with the new generation of petence of the child in school. Monographs of
objective personality measures. Behavioral Sciences the Society for Research in Child Development, 47,
and the Law, 8, 93–102. (Number 5).
Bagley, C., Wilson, G. D., & Boshier, R. (1970). The Ball, S., & Bogatz, G. A. (1970). The first year of Sesame
Conservatism scale: A factor structure comparison Street: An evaluation. Princeton, NJ: Educational
of English, Dutch, and New Zealand samples. The Testing Service.
Journal of Social Psychology, 81, 267–268. Ballard, R., Crino, M. D., & Rubenfeld, S. (1988). Social
Bagnato, S. J. (1984). Team congruence in develop- desirability response bias and the Marlowe-Crowne
mental diagnosis and intervention: Comparing clin- Social Desirability Scale. Psychological Reports, 63,
ical judgment and child performance measures. 227–237.
School Psychology Review, 13, 7–16. Balzer, W. K., & Sulsky, L. M. (1992). Halo and per-
Bagnato, S. J., & Neisworth, J. T. (1991). Assessment formance appraisal research: A critical examination.
for early intervention: Best practices for professionals. Journal of Applied Psychology, 77, 975–985.
New York: Guilford Press. Banks, M. H., Jackson, P. R., Stafford, E. M., & Warr,
Bahr, H. M., & Chadwick, B. A. (1974). Conservatism, P. B. (1983). The Job Components Inventory and
racial intolerance, and attitudes toward racial assim- the Analysis of Jobs requiring limited skill. Personnel
ilation among whites and American Indians. Journal Psychology, 36, 57–66.
of Social Psychology, 94, 45–56. Bannatyne, A. D. (1971). Language, reading, and learn-
Bailey, K. D. (1988). The conceptualization of validity: ing disabilities. Springfield, IL: Charles C Thomas.
Current perspectives. Social Science Research, 17, Banta, T. J. (1961). Social attitudes and response
117–136. styles. Educational and Psychological Measurement,
Bailey-Richardson, B. (1988). Review of the Vineland 21, 543–557.
Adaptive Behavior Scales, Classroom ed. Journal of Barba, R. H., & Mason, C. L. (1992). Construct validity
Psychoeducational Assessment, 6, 87–91. studies for the Draw-A-Computer User Test. Com-
Baillargeon, J., & Danis, C. (1984). Barnum meets puters in Human Behavior, 8, 231–237.
the computer: A critical test. Journal of Personality Barker, H. R., Fowler, R. D., & Peterson, L. P. (1971).
Assessment, 48, 415–419. Factor analytic structure of the short form MMPI
Baird, L. L. (1985). Do grades and tests predict adult in a VA hospital population. Journal of Clinical Psy-
accomplishment? Research in Higher Education, 23, chology, 27, 228–233.
3–85. Barkley, R. A. (1988). A review of Child Behavior Rat-
Baird, L. L. (1987). Do students think admissions tests ing Scales and checklists for research in child psy-
are fair? Do tests affect their decisions? Research in chopathology. In M. Rutter, H. Tuma, & I. Lann
Higher Education, 26, 373–388. (Eds.), Assessment and diagnosis in child and adoles-
Baird, P. (1983). Assessing life events and physical cent psychopathology. New York: Guilford Press.
health: Critique of the Holmes and Rahe scale. Psy- Barnett, A. J. (1982). Designing an assessment of the
chology, 20, 38–40. child with cerebral palsy. Psychology in the Schools,
Baker, A. F. (1983). Psychological assessment of autistic 19, 160–165.
children. Clinical Psychology Review, 3, 41–59. Barnette, W. L., Jr., & McCall, J. N. (1964). Validation
Baker, E. L., O’Neill, H. F., Jr., & Linn, R. L. (1993). of the Minnesota Vocational Interest Inventory for
Policy and validity prospects for performance- vocational high school boys. Journal of Applied Psy-
based assessment. American Psychologist, 48, chology, 48, 378–382.
1210–1218. Baron, J. (1985). Rationality and intelligence. Cam-
Baker, H. J., & Leland, B. (1959). Detroit Tests of Learn- bridge, UK: Cambridge University Press.
ing Aptitude. Indianapolis, IN: Bobbs-Merrill. Baron, J., & Norman, M. F. (1992). SATs, Achievement
Baker, R. L., Mednick, B. R., & Hocevar, D. (1991). Tests, and High-School class rank as predictors of
Utility of scales derived from teacher judgements college performance. Educational and Psychological
of adolescent academic performance and psychoso- Measurement, 52, 1047–1055.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 541

Barrett, E. T., & Gleser, G. C. (1987). Development Baughman, E. E., & Dahlstrom, W. G. (1972). Racial
and validation of the cognitive status examination. differences on the MMPI. In S. S. Guterman
Journal of Consulting and Clinical Psychology, 55, (Ed.), Black psyche. Berkeley, CA: Glendessary
877–882. Press.
Barrett, E. T., Wheatley, R. D., & LaPlant, R. J. (1983). Baum, D. D., & Kelly, T. J. (1979). The validity of
A brief clinical neuropsychological battery: Clinical Slosson Intelligence Test with learning disabled
classification trails. Journal of Clinical Psychology, 39, kindergarteners. Journal of Learning Disabilities, 12,
980–984. 268–270.
Barrett, G. V., Alexander, R. A., Doverspike, D., Bauman, M. K., & Kropf, C. A. (1979). Psychologi-
Cellar, D., & Thomas, J. C. (1982). The development cal tests used with blind and visually handicapped
and application of a computerized information- persons. School Psychology Digest, 8, 257–270.
processing test battery. Applied Psychological Mea- Bayley, N. (1968). Behavioral correlates of mental
surement, 6, 13–29. growth: Birth to thirty-six years. American Psychol-
Barrick, M. R., & Mount, M. K. (1991). The big five per- ogist, 23, 1–17.
sonality dimensions and job performance: A meta- Bayley, N. (1969). Bayley Scales of Infant Development.
analysis. Personnel Psychology, 44, 1–26. New York: Psychological Corporation.
Barringer, D. G., Strong, C. J., Blair, J. C., Clark, T. C., Bayley, N. (1986). Behavioral correlates of mental
& Watkins, S. (1993). Screening procedures used to growth: Birth to thirty-six years. Advances in Infancy
identify children with hearing loss. American Annals Research, 4, 16–37.
of the Deaf , 138, 420–426. Bayley, N. (1991). Consistency and variability in the
Barron, F. (1953). An Ego-Strength Scale which pre- growth of intelligence from birth to eighteen years.
dicts response to psychotherapy. Journal of Consult- Journal of Genetic Psychology, 152, 573–604.
ing Psychology, 17, 327–333. Beaber, R., Marston, A., Michelli, J., & Mills, M.
Barron, F. (1969). Creative person and creative process. (1985). A brief test for measuring malingering in
New York: Holt, Rinehart and Winston. schizophrenic individuals. American Journal of Psy-
Barry, B., Klanderman, J., & Stipe, D. (1983). Study chiatry, 142, 1478–1481.
number one. In A. S. Kaufman & N. L. Kaufman Beaver, A. P. (1953). Personality factors in choice of
(Eds.), Kaufman Assessment Battery for Children: nursing. Journal of Applied Psychology, 37, 374–379.
Interpretive manual (p. 94). Circle Pines, MN: Amer- Bechtoldt, H. P. (1959). Construct validity: A critique.
ican Guidance Service. American Psychologist, 14, 619–629.
Bar-Tal, D., & Bar-Zohar, Y. (1977). The relationship Bech, P., Gram, L. F., Dein, E., Jacobsen, O., Vitger, J., &
between perception of locus of control and academic Bolwig, T. G. (1975). Quantitative rating of depres-
achievement: Review and some educational impli- sive states: Correlations between clinical assessment,
cations. Contemporary Educational Psychology, 2, Beck’s Self Rating Scale and Hamilton’s Objec-
181–199. tive Rating Scale. Acta Psychiatrica Scandinavia, 51,
Barton, P. E., & Coley, R. J. (1994). Testing in America’s 161–170.
schools. Princeton, NJ: Educational Testing Service. Beck, A. T. (1967). Depression. New York: Harper &
Bass, B. M. (1956). Development and evaluation of a Row.
scale for measuring social acquiescence. The Journal Beck, A. T. (1978). Beck inventory. Philadelphia, PA:
of Abnormal and Social Psychology, 53, 296–299. Center for Cognitive Therapy.
Baucom, D. H. (1985). Review of the California Psy- Beck, A. T., & Beamesderfer, A. (1974). Assessment of
chological Inventory. In J. V. Mitchell, Jr. (Ed.), depression: The depression inventory. In P. Pichot
The ninth mental measurements yearbook (Vol. 1, (Ed.), Psychological measurement in pharmacopsy-
pp. 250–252). Lincoln, NB: University of Nebraska chiatry (Vol. 7). Basel: S. Karger.
Press. Beck, A. T., Rial, W. Y., & Rickels, K. (1974). Short form
Baucom, D. H., Epstein, N., Rankin, L. A., & Burnett, of depression inventory: Cross validation. Psycholog-
C. K. (1996). Assessing relationship standards: The ical Reports, 34, 1184–1186.
Inventory of Specific Relationship Standards. Jour- Beck, A. T., & Steer, R. A. (1984). Internal consistencies
nal of Family Psychology, 10, 72–88. of the original and revised Beck Depression Inven-
Bauer, M. S., Crits-Christoph, P., Ball, W. A., Dewees, tory. Journal of Clinical Psychology, 40, 1365–1367.
E., McAllister, T., Alahi, P., et al. (1991). Independent Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Beck
assessment of manic and depressive symptoms by Depression Inventory manual (2nd ed.). San Anto-
self-rating: Scale characteristics and implications for nio, TX: Psychological Corporation.
the study of mania. Archives of General Psychiatry, Beck, A. T., Steer, R. A., & Garbin, M. G. (1988).
48, 807–812. Psychometric properties of the Beck Depression
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

542 References

Inventory: Twenty-five years of evaluation. Clinical Personality Inventory. Journal of Consulting Psychol-
Psychology Review, 8, 77–100. ogy, 23, 181.
Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., & Benjamin, L. S. (1993). Every psychopathology is a gift
Erbaugh, J. (1961). An inventory for measuring de- of love. Psychotherapy Research, 3, 1–24.
pression. Archives of General Psychiatry, 4, 561–571. Benjamin, L. T. (1984). Staying with initial answers on
Beck, M. D., & Beck, C. K. (1980). Multitrait- objective tests: Is it a myth? Teaching of Psychology,
multimethod validation of four personality mea- 11, 133–141.
sures with a high-school sample. Educational and Bennett, R. E., & Ward, W. C. (1993). Construction
Psychological Measurement, 40, 1005–1011. versus choice in cognitive measurement: Issues in con-
Becker, J. T., & Sabatino, D. A. (1973). Frostig revisited. structed response, performance testing, and portfolio
Journal of Learning Disabilities, 6, 180–184. assessment. Hillsdale, NJ: Erlbaum.
Becker, R. L. (1988). Reading-free Vocational Interest Bennett, V. D. C. (1964). Development of a self con-
Inventory-Revised. Columbus, OH: Elbern. cept Q sort for use with elementary age school
Becker, T. E., & Colquitt, A. L. (1992). Potential vs. children. Journal of School Psychology, 3, 19–
actual faking of a biodata form: An analysis along 25.
several dimensions of item type. Personnel Psychol- Benson, J., & Hocevar, D. (1985). The impact of item
ogy, 45, 389–406. phrasing on the validity of attitude scales for elemen-
Beery, K. (1989). The VMI (Developmental Test of tary school children. Journal of Educational Mea-
Visual-Motor Integration) administration, scoring, surement, 22, 213–240.
and teaching manual. Cleveland, OH: Modern Cur- Bentler, P. (1972). Review of the Tennessee Self Concept
riculum Press. Scale. In O. K. Buros (Ed.), The seventh mental mea-
Beery, K. K., & Buktenica, N. A. (1967). Developmental surements yearbook (pp. 366–367). Highland Park,
Test of Visual-Motor Integration. Chicago, IL: Follett NJ: Gryphon Press.
Educational Corporation. Bentler, P. M., & LaVoie, A. L. (1972). An extension
Behar, L. B. (1977). The Preschool Behavior Question- of semantic space. Journal of Verbal Learning and
naire. Journal of Abnormal Child Psychology, 5, 265– Verbal Behavior, 11, 174–182.
276. Bentler, P. M., & Speckart, G. (1979). Attitude orga-
Bejar, I. I., Embretson, S., & Mayer, R. E. (1987). Cog- nization and the attitude-behavior relationship.
nitive psychology and the SAT: A review of some Journal of Personality and Social Psychology, 37,
implications. ETS Research Report, #87–28. 913–929.
Bekker, L. D., & Taylor, C. (1966). Attitudes toward Benton, A. L. (1974). The Visual Retention Test. New
the aged in a multigenerational sample. Journal of York: Psychological Corporation.
Gerontology, 21, 115–118. Benton, A. L., Hamsher, K. de S., Varney, N. R., &
Bell, D. C., & Bell, L. G. (1989). Micro and macro Spreen, O. (1983). Contributions to neuropsycholog-
measurement of family systems concepts. Journal of ical assessment: A clinical manual. New York: Oxford
Family Psychology, 3, 137–157. University Press.
Bellack, A. S., & Hersen, M. (1977). Self-report inven- Berdie, R. F. (1950). Scores on the SVIB and the Kuder
tories in behavioral assessment. In J. D. Cone & R. P. Preference Record in relation to self-ratings. Journal
Hawkins (Eds.), Behavioral assessment (pp. 52–76). of Applied Psychology, 34, 42–49.
New York: Brunner/Mazel. Berg, I. A. (1955). Response bias and personality:
Bellack, A. S., Hersen, M., & Lamparski, D. (1979). Role The deviation hypothesis. Journal of Psychology, 40,
play tests for social skills: Are they valid? Are they 60–71.
useful? Journal of Consulting and Clinical Psychology, Berg, I. A. (1959). The unimportance of test item con-
47, 335–342. tent. In B. M. Bass & I. A. Berg (Eds.), Objective
Bellak, L. (1975). The T.A.T., C.A.T. and S.A.T. in clin- approaches to personality assessment (pp. 83–99).
ical use (3rd ed.). New York: Grune & Stratton. Princeton, NJ: D. Van Nostrand.
Bellak, L. (1986). The Thematic Apperception Test, the Bergan, J. R. (1977). Behavioral consultation. Colum-
Children’s Apperception Test, and the Senior Apper- bus, OH: Charles E. Merrill.
ception Technique in clinical use (4th ed.). Orlando, Bergan, J. R., & Kratochwill, T. R. (1990). Behavioral
FL: Grune & Stratton. consultation in applied settings. New York: Plenum
Bender, L. (1938). A visual motor gestalt test and Press.
its clinical use. American Orthopsychiatric Associ- Bergland, B. (1974). Career planning: The use of
ation, Research Monographs, No. 3. sequential evaluated experience. In E. L. Herr
Bendig, A. W. (1959). Score reliability of dichotomous (Ed.), Vocational guidance and human development.
and trichotomous item responses on the Maudsley Boston, MA: Houghton Mifflin.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 543

Bergner, M., Bobbitt, R. A., Carter, W. B., Gilson, B. S. psychotherapy and behavior change (pp. 257–310).
(1981). The Sickness Impact Profile: Development New York: Wiley.
and final revision of a health status measure. Medical Bigler, E. D. (1986). Forensic issues in neuropsychol-
Care, 19, 787–785. ogy. In D. Wedding, A. M. Horton, Jr., & J. Web-
Bergner, M., Bobbitt, R. A., Kressel, S., et al. (1976). The ster (Eds.), The neuropsychology handbook (pp. 526–
Sickness Impact Profile: Conceptual foundation and 547). Berlin: Springer-Verlag.
methodology for the development of a health status Bigler, E. D., & Nussbaum, N. L. (1989). Child neu-
measure. International Journal of Health Service, 6, ropsychology in the private medical practice. In C. R.
393–415. Reynolds & E. Fletcher-Janzen (Eds.), Handbook of
Bergner, M., & Rothman, M. L. (1987). Health sta- clinical child neuropsychology (pp. 557–572). New
tus measures: An overview and guide for selection. York: Plenum Press.
Annual Review of Public Health, 8, 191–210. Binder, L. M. (1992). Deception and malingering. In
Berk, R. A. (1982). Handbook of methods for detecting A. E. Puente & R. J. McCaffrey (Eds.), Handbook of
test bias. Baltimore, MD: Johns Hopkins University neuropsychological assessment (pp. 353–374). New
Press. York: Plenum Press.
Berk, R. A. (Ed.) (1984). A guide to criterion-referenced Binder, L. M. (1993). Assignment of malingering after
test construction. Baltimore, MD: Johns Hopkins mild head trauma with the Portland Digit Recogni-
University Press. tion Test. Journal of Clinical and Experimental Neu-
Berk, R. A. (1986). A consumer’s guide to setting per- ropsychology, 15, 170–182.
formance standards on criterion-referenced tests. Binder, L. M., & Willis, S. C. (1991). Assessment
Review of Educational Research, 56, 137–172. of motivation after financially compensable minor
Berlinsky, S. (1952). Measurement of the intelligence head trauma. Psychological Assessment, 3, 175–181.
and personality of the deaf: A review of the litera- Binet, A., & Simon, T. (1905). Methodes nouvelles pour
ture. Journal of Speech and Hearing Disorders, 117, le diagnostic du niveau intellectual des anormaux.
39–54. Annee Psychologique, 11, 191–244.
Berman, A., & Hays, T. (1973). Relation between belief Birren, J. E., & Birren, B. A. (1990). The concepts, mod-
in after-life and locus of control. Journal of Consult- els, and history of the psychology of aging. In J. E.
ing and Clinical Psychology, 41, 318. Birren, & K. W. Schaie (Eds.), Handbook of the psy-
Bernard, L. C., Houston, W., & Natoli, L. (1993). chology of aging (3rd ed., pp. 3–20). New York: Aca-
Malingering on neuropsychological memory tests: demic Press.
Potential objective indicators. Journal of Clinical Bishop, D., & Butterworth, G. E. (1979). A longitudinal
Psychology, 49, 45–53. study using the WPPSI and WISC-R with an English
Berry, D. T. R., Baer, R. A., & Harris, M. J. (1991). sample. British Journal of Educational Psychology, 49,
Detection of malingering on the MMPI: A meta- 156–168.
analysis. Clinical Psychology Review, 11, 585–598. Blanchard, R., Klassen, P., Dickey, R., Kuban, M. E.,
Berry, J. W. (1974). Radical cultural relativism and & Blak, T. (2001). Sensitivity and specificity of the
the concept of intelligence. In J. W. Berry & P. R. Phallometric test for pedophilia in nonadmitting sex
Dasen (Eds.), Culture and cognition: Readings in offenders. Psychological Assessment, 13, 118–126.
cross-cultural psychology (pp. 225–229). London: Blankstein, K. R., Flett, G. L., & Koledin, S. (1991). The
Methuen. Brief College Student Hassles Scale: Development,
Berry, K., & Sherrets, S. (1975). A comparison of the validation, and relation with pessimism. Journal of
WISC and WISC-R for special education. Pediatric College Student Development, 32, 258–264.
Psychology, 3, 14. Blau, T. H. (1991). The psychological examination of the
Bersoff, D. N. (1970). The revised deterioration for- child. New York: Wiley.
mula for the Wechsler Adult Intelligence Scale: A Blazer, D., Hughes, D. C., & George, L. K. (1987). The
test of validity. Journal of Clinical Psychology, 26, epidemiology of depression in an elderly commu-
71–73. nity population. The Gerontologist, 27, 281–287.
Beutler, L. E., Arizmendi, T. G., Crago, M., Shanfield, Block, J. (1957). A comparison between ipsative and
S., & Hagaman, R. (1983). The effects of value sim- normative ratings of personality. Journal of Abnor-
ilarity and clients’ persuadability on value conver- mal and Social Psychology, 54, 50–54.
gence and psychotherapy improvement. Journal of Block, J. (1961). The Q-sort method in personality
Social and Clinical Psychology, 1, 231–245. assessment and psychiatric research. Springfield, IL:
Beutler, L. E., Crago, M., & Arizmendi, T. G. (1986). Charles C Thomas.
Research on therapist variables in psychotherapy. Block, J. (1978). The Q-sort method. Palo Alto, CA:
In S. L. Garfield & A. E. Bergin (Eds.), Handbook of Consulting Psychologists Press.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

544 References

Block, J. (1981). Some enduring and consequential Bondy, M. (1974). Psychiatric antecedents of psycho-
structures of personality. In A. I. Rabin (Ed.), Fur- logical testing (before Binet). Journal of the History
ther explorations in personality. New York: Wiley of the Behavioral Sciences, 10, 180–194.
InterScience. Booth-Kewley, S., Edwards, J. E., & Rosenfeld,
Block, J., Weiss, D. S., & Thorne, A. (1979). How P. (1992). Impression management, social desir-
relevant is a semantic similarity interpretation of ability, and computer administration of atti-
ratings? Journal of Personality and Social Psychology, tude questionnaires: Does the computer make a
37, 1055–1074. difference? Journal of Applied Psychology, 77, 562–
Bloom, A. S., & Raskin, L. M. (1980). WISC-R Verbal- 566.
Performance IQ discrepancies: A comparison of Booth-Kewley, S., Rosenfeld, P., & Edwards, J. E.
learning disabled children to the normative sample. (1992). Impression management and self-deceptive
Journal of Clinical Psychology, 36, 322–323. enhancement among Hispanic and non-Hispanic
Bloom, A. S., Reese, A., Altshuler, L., Meckler, C. L., & white Navy recruits. Journal of Social Psychology,
Raskin, L. M. (1983). IQ discrepancies between the 132, 323–329.
Binet and WISC-R in children with developmental Borgatta, E. F. (1964). The structure of personality
problems. Journal of Clinical Psychology, 39, 600– characteristics. Behavioral Science, 9, 8–17.
603. Borgen, F. H. (1986). New approaches to the assess-
Bloom, B. (Ed.) (1956). Taxonomy of educational objec- ment of interests. In W. B. Walsh & S. H. Osipow
tives. Handbook 1: Cognitive domain. New York: (Eds.), Advances in vocational psychology (Vol. 1,
Longmans, Green & Co. pp. 83–125). Hillsdale, NJ: Erlbaum.
Bloom, L., & Lahey, M. (1978). Language development Borgen, F. H., & Harper, G. T. (1973). Predictive valid-
and language disorders. New York: Wiley. ity of measured vocational interests with black and
Blum, G. S. (1950). The Blacky Pictures. Cleveland, OH: white college men. Measurement and Evaluation in
The Psychological Corporation. Guidance, 6, 19–27.
Blum, J. E., Fosshage, J. L., & Jarvix, L. F. (1972). Intel- Boring, E. G. (1957). A history of experimental psychol-
lectual changes and sex differences in octogenarians: ogy. New York: Appleton-Century-Crofts.
A twenty-year longitudinal study of aging. Develop- Boring, E. G. (1965). On the subjectivity of important
mental Psychology, 7, 178–187. historical dates: Leipzig, 1879. Journal of the History
Blumenthal, M. (1975). Measuring depressive symp- of the Behavioral Sciences, 1, 5–9.
tomatology in a general population. Archives of Gen- Bornstein, P. H., Bridgwater, C. A., Hickey, J. S., &
eral Psychiatry, 34, 971–978. Sweeney, T. M. (1980). Characteristics and trends in
Bochner, S. (1978). Reliability of the Peabody Picture behavioral assessment: An archival analysis. Behav-
Vocabulary Test: A review of 32 selected research ioral Assessment, 2, 125–133.
studies published between 1965 and 1974. Psychol- Bornstein, R. F. (1999). Criterion validity of objec-
ogy in the Schools, 15, 320–327. tive and projective dependency tests: A metaanalytic
Boehm, A. E. (1971). Boehm Test of Basic Concepts. assessment of behavioral prediction. Psychological
New York: The Psychological Corporation. Assessment, 11, 48–57.
Boehm, V. R. (1968). Mr. Prejudice, Miss Sympathy, Bornstein, R. F., Hill, E. L., Robinson, K. J., Cal-
and the authoritarian personality: An application of abrese, C., & Bowers, K. S. (1996). Internal relia-
psychological measuring techniques to the problem bility of Rorschach Oral Dependency Scale scores.
of jury bias. Wisconsin Law Review, 734–750. Educational and Psychological Measurement, 56,
Bogardus, E. S. (1925). Measuring social distance. Jour- 130–138.
nal of Applied Sociology, 9, 299–308. Bortner, M. (1965). Review of the Progressive Matri-
Boldt, R. F. (1986). Generalization of SAT validity ces. In O. K. Borum, R., & Grisso, T. (1995). Psy-
across colleges. ETS Research Report, #86–24. chological test use in criminal forensic evaluations.
Bolt, M. (1978). The Rokeach Value Survey: Preferred Professional Psychology, 26, 465–473.
or preferable? Perceptual and Motor Skills, 47, 322. Borum, R., & Grisso, T. (1995). Psychological test
Bolton, B. (Ed.) (1990). Special education and rehabili- use in criminal forensic evaluations. Professional
tation testing: Practical applications and test reviews. Psychology: Research and Practice, 26, 465–473.
Austin, TX: Proed, Applied Testing Series. Bosman, F., Hoogenboom, J., & Walpot, G. (1994). An
Bolton, B. (1995). Review of the Inwald Personality interactive video test for pharmaceutical chemist’s
Inventory. In J. C. Conoley & J. C. Impara (Eds.), assistants. Computers in Human Behavior, 10, 51–
The twelfth mental measurements yearbook (pp. 501– 62.
503). Lincoln, NB: The University of Nebraska Boswell, D. L., & Pickett, J. A. (1991). A study of
Press. the internal consistency and factor structure of
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 545

the Verbalizer-Visualizer Questionnaire. Journal of Braden, J. P. (1990). Do deaf persons have a character-
Mental Imagery, 15, 33–36. istic psychometric profile on the Wechsler Perfor-
Boudreau, R. A., Killip, S. M., MacInnis, S. H., Milloy, mance Scales? Journal of Psychoeducational Assess-
D. G., & Rogers, T. B. (1983). An evaluation of Grad- ment, 8, 518–526.
uate Record Examinations as predictors of graduate Braden, J. P. (1992). The Differential Ability Scales
success in a Canadian context. Canadian Psychology, and Special Education. Journal of Psychoeducational
24, 191–199. Assessment, 10, 92–98.
Bouman, T. K., & Luteijn, F. (1986). Relations between Bradley, R. H., & Caldwell, B. M. (1974). Issues
the Pleasant Events Schedule, depression, and other and procedures in testing young children. Eric TM
aspects of psychopathology. Journal of Abnormal Report, #37. Princeton, NJ: Educational Testing
Psychology, 95, 373–377. Service.
Bowman, M. L. (1989). Testing individual differences Bradley, R. H., & Caldwell, B. M. (1977).
in ancient China. American Psychologist, 44, 576– Home observation for measurement of the
578. environment: A validation study of screening effi-
Boyd, M. E., & Ward, G. (1967). Validities of the D-48 ciency. American Journal of Mental Deficiency, 81,
test for use with college students. Educational and 417–420.
Psychological Measurement, 27, 1137–1138. Bradley, R. H., & Caldwell, B. M. (1979). Home obser-
Boyle, G. J. (1989). Confirmation of the structural vation for measurement of the environment: A revi-
dimensionality of the Stanford-Binet Intelligence sion of the Preschool Scale. American Journal of
Scale (4th ed.). Personality and Individual Differ- Mental Deficiency, 84, 235–244.
ences, 10, 709–715. Bradley, R. H., Caldwell, B. M., & Elardo, R. (1977).
Bracken, B. A. (1981). McCarthy Scales as a learning Home environment, social status, and mental test
disability diagnostic aid: A closer look. Journal of performance. Journal of Educational Psychology, 69,
Learning Disabilities, 14, 128–130. 697–701.
Bracken, B. A. (1984). Bracken Basic Concept Scale. San Bragman, R. (1982). Effects of different methods of
Antonio, TX: The Psychological Corporation. conveying test directions on deaf children’s perfor-
Bracken, B. A. (1985). A critical review of the Kaufman mance on pattern recognition tasks. Journal of Reha-
Assessment Battery for Children (K-ABC). School bilitation of the Deaf , 16, 17–26.
Psychology Review, 14, 21–36. Bragman, R. (1982). Review of research on test instruc-
Bracken, B. A. (1987). Limitations of preschool instru- tions for deaf children. American Annals of the Deaf ,
ments and standards for minimal levels of technical 127, 337–346.
adequacy. Journal of Psychoeducational Assessment, Braithwaite, V. A., & Law, H. G. (1985). Structure of
5, 313–326. human values: Testing the adequacy of the Rokeach
Bracken, B. A. (1991). The assessment of preschool Value Survey. Journal of Personality and Social Psy-
children with the McCarthy Scales of Children’s chology, 49, 250–263.
Abilities. In B. A. Bracken (Ed.), The Psychoedu- Brannick, M. T., Michaels, C. E., & Baker, D. P. (1989).
cational assessment of preschool children (2nd. ed., Construct validity of in-basket scores. Journal of
pp. 53–85). Boston, MA: Allyn & Bacon. Applied Psychology, 74, 957–963.
Bracken, B. A., Prasse, D. P., & McCallum, R. S. Brannigan, G. G., Aabye, S. M., Baker, L. A., & Ryan,
(1984). Peabody Picture Vocabulary Test-Revised: G. T. (1995). Further validation of the qualitative
An appraisal and review. School Psychology Review, scoring system for the modified Bender-Gestalt Test.
13, 49–60. Psychology in the Schools, 32, 24–26.
Bradburn, N. M., & Caplovitz, D. (1965). Reports on Brauer, B. A. (1993). Adequacy of a translation of
happiness. Chicago, IL: Aldine. the MMPI into American Sign Language for use
Braden, J. P. (1984). The factorial similarity of the with deaf individuals: Linguistic equivalency issues.
WISC-R Performance Scale in deaf and hearing Rehabilitation Psychology, 38, 247–260.
samples. Journal of Personality and Individual Dif- Braun, P. R., & Reynolds, D. N. (1969). A factor anal-
ferences, 4, 403–410. ysis of a 100-item fear survey inventory. Behaviour
Braden, J. P. (1985). The structure of nonverbal intelli- Research and Therapy, 7, 399–402.
gence in deaf and hearing subjects. American Annals Bray, D. W. (1982). The assessment center and the
of the Deaf , 130, 496–501. study of lives. American Psychologist, 37, 180–
Braden, J. P. (1989). The criterion-related validity of 189.
the WISC-R Performance Scale and other nonverbal Brazelton, T. B. (1973). Neonatal Behavioral Assessment
IQ tests for deaf children. American Annals of the Scale (Clinics in Developmental Medicine, No. 50).
Deaf , 134, 329–332. Philadelphia, PA: J. B. Lippincott.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

546 References

Brazelton, T. B., Robey, J. S., & Collier, G. (1969). Infant ages six to ten referred for psychological evaluation.
development in the Zinacanteco Indians of South- Psychology in the Schools, 14, 30–33.
ern Mexico. Pediatrics, 44, 274–293. Brown, A. L. (1978). Knowing when, where, and
Breckler, S. J. (1984). Empirical validation of affect, how to remember: A problem of metacogni-
behavior, and cognition as distinct components of tion. In R. Glaser (Ed.), Advances in instructional
attitude. Journal of Personality and Social Psychology, psychology (Vol. 1, pp. 77–165). Hillsdale, NJ:
47, 1191–1205. Erlbaum.
Breen, M. J., Carlson, M., & Lehman, J. (1985). Brown, D. C. (1994). Subgroup norming: Legitimate
The revised developmental test of visual-motor testing practice or reverse discrimination? American
integration: Its relation to the VMI, WISC-R, and Psychologist, 49, 927–928.
Bender Gestalt for a group of elementary aged learn- Brown, F. (1979). The SOMPA: A system of measur-
ing disabled students. Journal of Learning Disabili- ing potential abilities? School Psychology Digest, 8,
ties, 18, 136–138. 37–46.
Breland, H. M., & Ironson, G. H. (1976). DeFunis Brown, F. G. (1976). Principles of educational and
reconsidered: A comparative analysis of alternative psychological testing (2nd ed.). New York: Holt,
admissions strategies. Journal of Educational Mea- Rinehart and Winston.
surement, 13, 89–99. Brown, H. S., & May, A. E. (1979). A test-retest relia-
Brennan, R. L. (1983). Elements of generalizability the- bility study of the Wechsler Adult Intelligence Scale.
ory. Iowa City, IO: ACT Publications. Journal of Consulting and Clinical Psychology, 47,
Bretz, R. D., Jr., Ash, R. A., & Dreher, G. F. (1989). 601–602.
Do people make the place? An examination of the Brown, I. S. (1984). Development of a scale to mea-
attraction-selection-attrition hypothesis. Personnel sure attitude toward the condom as a method of
Psychology, 42, 561–581. birth control. The Journal of Sex Research, 20, 255–
Bridgeman, B., & Lewis, C. (1994). The relationship of 263.
essay and multiple-choice scores with grades in col- Brown, S. A., Goldman, M. S., Inn, A., & Anderson,
lege courses. Journal of Educational Measurement, L. R. (1980). Expectations of reinforcement from
31, 37–50. alcohol: Their domain and relation to drinking pat-
Brigance, A. H. (1978). Brigance Diagnostic Inventory terns. Journal of Consulting and Clinical Psychology,
of Early Development. Billerica, MA: Curriculum 48, 419–426.
Associates. Brown, S. H. (1978). Long-term validity of a personal
Briggs, S. R. (1992). Assessing the five-factor model of history item scoring procedure. Journal of Applied
personality description. Journal of Personality, 60, Psychology, 63, 673–676.
253–293. Brown, S. R. (1980). Political subjectivity: Application of
Bringmann, W. G., Balance, W. D. G., & Evans, R. B. Q methodology in political science. New Haven, CT:
(1975). Wilhelm Wundt 1832–1920: A brief bio- Yale University Press.
graphical sketch. Journal of the History of the Behav- Brown, T. L. (1991). Concurrent validity of the
ioral Sciences, 11, 287–297. Stanford-Binet (4th ed.) Agreement with the WISC-
Brislin, R. W. (1970). Back-translation for cross- R in classifying learning disabled children. Psycho-
cultural research. Journal of Cross-Cultural Psychol- logical Assessment, 3, 247–253.
ogy, 1, 185–216. Brozek, J. (1972). To test or not to test: Trends in the
Brodsky, S. (1989). Advocacy in the guise of scien- Soviet views. Journal of the History of the Behavioral
tific objectivity: An examination of Faust and Ziskin. Sciences, 8, 243–248.
Computers in Human Behavior, 5, 261–264. Bruce, M. M., & Learner, D. B. (1958). A supervisory
Brody, N. (1985). The validity of tests of intelligence. practices test. Personnel Psychology, 11, 207–216.
In B. B. Wolman (Ed.), Handbook of intelligence Bruch, M. A. (1977). Psychological Screening Inven-
(pp. 353–389). New York: Wiley. tory as a predictor of college student adjustment.
Broen, W. E., Jr., & Wirt, R. D. (1958). Varieties of Journal of Consulting and Clinical Psychology, 45,
response sets. Journal of Consulting Psychology, 22, 237–244.
237–240. Bruck, M., Ceci, S. J., & Hembrooke, H. (1998). Reli-
Brooks, C. M., Jackson, J. R., Hoffman, H. H., & Hand, ability and credibility of young children’s reports.
G. S., Jr. (1981). Validity of the new MCAT for pre- American Psychologist, 53, 136–151.
dicting GPA and NBME Part 1 Examination Perfor- Bruhn, A. R., & Reed, M. R. (1975). Simulation of
mance. Journal of Medical Education, 56, 767–769. brain damage on the Bender-Gestalt Test by col-
Brooks, C. R. (1977). WISC, WISC-R, SB L & M, lege students. Journal of Personality Assessment, 3,
WRAT: Relationships and trends among children 244–255.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 547

Bruininks, R. H. (1978). Bruininks-Oseretsky Test of Burke, H. R. (1972). Raven’s Progressive Matrices:


Motor Proficiency, examiner’s manual. Circle Pines, validity, reliability, and norms. The Journal of Psy-
MN: American Guidance Service. chology, 82, 253–257.
Bruvold, W. H. (1968). Scales for rating the taste of Burke, H. R., & Bingham, W. C. (1969). Raven’s Pro-
water. Journal of Applied Psychology, 52, 245–253. gressive Matrices: More on construct validity. The
Bruvold, W. H. (1975). Judgmental bias in the rating Journal of Psychology, 72, 247–251.
of attitude statements. Educational and Psychological Burke, M. J., & Normand, J. (1987). Computerized
Measurement, 35, 605–611. psychological testing: Overview and critique. Profes-
Buck, J. N. (1966). The House-Tree-Person technique: sional Psychology: Research and Practice, 18, 42–51.
Revised manual. Beverly Hills, CA: Western Psycho- Burke, M. J., Normand, J., & Raju, N. S. (1987).
logical Services. Examinee attitudes toward computer-administered
Buckhalt, J. A. (1986). Test review of the British Ability ability testing. Computers in Human Behavior, 3,
Scales. Journal of Psychoeducational Assessment, 4, 95–107.
325–332. Burkhart, B. R., Christian, W. L., & Gynther, M. D.
Buckhalt, J. A. (1990). Criterion-related validity of (1978). Item subtlety and the MMPI: A paradoxical
the British Ability Scales Short-Form for black relationship. Journal of Personality Assessment, 42,
and white children. Psychological Reports, 66, 1059– 76–80.
1066. Burnam, M. A., Telles, C. A., Karno, M., Hough, R. L., &
Buckhalt, J. A. (1991). Test review of the Wech- Escobar, J. I. (1987). Measurement of acculturation
sler Preschool and Primary Scale of Intelligence – in a community population of Mexican Americans.
Revised. Journal of Psychoeducational Assessment, 9, Hispanic Journal of Behavioral Sciences, 9, 105–130.
271–279. Buros, O. K. (Ed.). The sixth mental measurements
Buckhalt, J. A., Denes, G. E., & Stratton, S. P. (1989). yearbook (pp. 762–765). Highland Park, NJ: The
Validity of the British Ability Scales Short-Form for Gryphon Press.
a sample of U.S. students. School Psychology Inter- Buss, A. H. (1989). Personality as traits. American Psy-
national, 10, 185–191. chologist, 44, 1378–1388.
Budoff, M., & Friedman, M. (1964). “Learning poten- Buss, A. R. (1976). Galton and the birth of differential
tial” as an assessment approach to the adolescent psychology and eugenics: Social, political, and eco-
mentally retarded. Journal of Consulting Psychology, nomic forces. Journal of the History of the Behavioral
28, 433–439. Sciences, 12, 47–58.
Buechley, R., & Ball, H. (1952). A new test of “validity” Buss, D. M., & Scheier, M. F. (1976). Self-
for the group MMPI. Journal of Consulting Psychol- consciousness, self-awareness, and self-attribution.
ogy, 16, 299–301. Journal of Research in Personality, 10, 463–468.
Burden, R. L., & Fraser, B. J. (1993). Use of classroom Butcher, J. N. (1978). Automated MMPI interpretative
environment assessments in school psychology: A systems. In O. K. Buros (Ed.), Eighth mental mea-
British perspective. Psychology in the Schools, 30, surements yearbook (pp. 942–945). Highland Park,
232–240. NJ: Gryphon Press.
Burg, C., Quinn, P. O., & Rapoport, J. L. (1978). Clinical Butcher, J. N. (1979). Use of the MMPI in personnel
evaluation of one-year-old infants: Possible predic- selection. In J. N. Butcher (Ed.), New developments
tors of risk for the “hyperactivity syndrome.” Journal in the use of the MMPI. Minneapolis, MN: University
of Pediatric Psychology, 3, 164–167. of Minnesota Press.
Burgemeister, B. B., Blum, L., & Lorge, I. (1972). Butcher, J. N. (1990). MMPI-2 in psychological treat-
Columbia Mental Scale. New York: Harcourt, Brace, ment. New York: Oxford University Press.
Jovanovich. Butcher, J. N. (Ed.) (1996). International adaptations of
Burger, G. K. (1975). A short form of the California the MMPI-2. Minneapolis, MN: University of Min-
Psychological Inventory. Psychological Reports, 37, nesota Press.
179–182. Butcher, J. N., Dahlstrom, W. G., Graham, J. R., Telle-
Burgess, A. (1991). Profile analysis of the Wechsler gen, A., & Kaemmer, B. (1989). Manual for adminis-
Intelligence Scales: A new index of subtest scatter. tration and scoring: Minnesota Multiphasic Personal-
British Journal of Clinical Psychology, 30, 257–263. ity Inventory-2: (MMPI-2). Minneapolis: University
Burgess, E. W., & Cottrell, L. S. (1939). Predicting suc- of Minnesota Press.
cess or failure in marriage. New York: Prentice Hall. Butcher, J. N., Graham, J. R., Williams, C. L., & Ben-
Burisch, M. (1984). Approaches to personality inven- Porath, Y. S. (1990). Development and use of the
tory construction. American Psychologist, 39, 214– MMPI-2 content scales. Minneapolis, MN: Univer-
227. sity of Minnesota Press.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

548 References

Butcher, J. N., Keller, L. S., & Bacon, S. F. (1985). Cur- Campbell, D. P. (1974). Manual for the Strong-
rent developments and future directions in comput- Campbell Interest Inventory. Stanford, CA: Stanford
erized personality assessment. Journal of Consulting University.
and Clinical Psychology, 53, 803–815. Campbell, D. P., & Hansen, J. C. (1981). Manual for
Butcher, J. N., & Owen, P. (1978). Objective personal- the SVIB-SCII (3rd ed.). Stanford, CA: Stanford
ity inventories: Recent research and some contem- University.
porary issues. In B. B. Wolman (Ed.), Clinical diag- Campbell, D. T. (1960). Recommendations for APA
nosis of mental disorders: A handbook (pp. 475–546). test standards regarding construct, trait, and dis-
New York: Plenum. criminant validity. American Psychologist, 15, 546–
Byrne, B. M., & Baron, P. (1994). Measuring adolescent 553.
depression: Tests of equivalent factorial structure for Campbell, D. T., & Fiske, D. W. (1959). Convergent
English and French versions of the Beck Depres- and discriminant validation by the multitrait-multi-
sion Inventory. Applied Psychology: An International method matrix. Psychological Bulletin, 56, 81–105.
Review, 43, 33–47. Campbell, J. M., Amerikaner, M., Swank, P., & Vincent,
Byrnes, J. P., & Takahira, S. (1993). Explaining gen- K. (1989). The relationship between the Hardiness
der differences on SAT-Math items. Developmental Test and the Personal Orientation Inventory. Journal
Psychology, 29, 805–810. of Research in Personality, 23, 373–380.
Bzoch, K. R., & League, R. (1971). Assessing language Campbell, J. P. (1990). An overview of the Army selec-
skills in infancy: A handbook for multidimensional tion and classification project (Project A). Personnel
analysis of emergent language. Gainesville, FL: Tree Psychology, 43, 231–239.
of Life Press. Canfield, A. A. (1951). The “sten” scale – A modified
Cacioppo, J. T., Petty, R. E., & Geen, T. R. (1989). Atti- C scale. Educational and Psychological Measurement,
tude structure and function: From the tripartite to 11, 295–297.
the homeostasis model of attitudes. In A. R. Pratka- Canivez, G. L., & Watkins, M. W. (1998), Long-
nis, S. J. Breckler, & A. G. Greenwald (Eds.), Attitude term stability of the Wechsler Intelligence Scale for
structure and function (pp. 275–309). Hillsdale, NJ: Children – Third Edition. Psychological Assessment,
Erlbaum. 10, 285–291.
Cain, L. F., Levine, S., & Elzey, F. F. (1977). Manual for Canter, A. H. (1952). MMPI profile in multiple scle-
the Cain-Levine Social Competency Scale. Palo Alto, rosis. Journal of Consulting Psychology, 15, 353–356.
CA: Consulting Psychologists Press. Canter, D. (1969). An intergroup comparison of con-
Cairns, R. B., & Green, J. A. (1979). How to assess notative dimensions in architecture. Environment
personality and social patterns: Observations or and Behavior, 1, 37–48.
ratings? In R. B. Cairns & J. A. Green (Eds.), Cardno, J. A. (1955). The notion of attitude: An his-
The analysis of social interactions. Hillsdale, NJ: torical note. Psychological Reports, 1, 345–352.
Erlbaum. Carey, W. B., & McDevitt, S. C. (1978). Stability and
Cairo, P. C. (1979). The validity of the Holland and change in individual temperament diagnoses from
Basic Interest Scales of the Strong Vocational Inter- infancy to early childhood. Journal of the American
est Blank: Leisure activities vs. occupational mem- Academy of Child Psychiatry, 17, 331–337.
bership as criteria. Journal of Vocational Behavior, Carline, J. D., Cullen, T. J., Scott, C. S., Shannon, N. F.,
15, 68–77. & Schaad, D. (1983). Predicting performance dur-
Caldwell, A. B. (2001). What do the MMPI scales fun- ing clinical years from the New Medical College
damentally measure? Some hypotheses. Journal of Admission Test. Journal of Medical Education, 58,
Personality Assessment, 76, 1–17. 18–25.
Camara, W. L., & Schneider, D. L. (1994). Integrity Carlson, C. I., & Grotevant, H. D. (1987). A compar-
tests: Facts and unresolved issues. American Psychol- ative review of family rating scales: Guidelines for
ogist, 49, 112–119. clinicians and researchers. Journal of Family Psychol-
Campbell, A., Converse, P. E., & Rodgers, W. L. (1976). ogy, 1, 23–47.
The quality of American life. New York: Russell Sage Carlson, J. G. (1985). Recent assessments of the Myers-
Foundation. Briggs Type Indicator. Journal of Personality Assess-
Campbell, D. P. (1966). The stability of vocational ment, 49, 356–365.
interests within occupations over long time spans. Carlson, J. H., & Williams, T. (1993). Perspectives on
Personnel and Guidance Journal, 44, 1012–1019. the seriousness of crimes. Social Science Research,
Campbell, D. P. (1971). Handbook for the Strong 22, 190–207
Vocational Interest Blank. Stanford, CA: Stanford Carlson, J. S., & Jensen, C. M. (1981). Reliability of the
University. Raven Colored Progressive Matrices Test: Age and
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 549

ethnic group comparisons. Journal of Consulting and for children (K-ABC). Journal of Psychoeducational
Clinical Psychology, 49, 320–322. Assessment, 8, 155–164.
Carlyn, M. (1977). An assessment of the Myers-Briggs Caruso, J. C. (2000). Reliability generalization of the
Type Indicator. Journal of Personality Assessment, 41, NWO Personality scales. Educational and psycholog-
461–473. ical measurement, 60, 236–254.
Carmines, E. G., & Zeller, R. A. (1979). Reliability and Carvajal, H. (1987a). Correlations between scores on
validity assessment. Beverly Hills, CA: Sage. Stanford-Binet IV and Wechsler Adult Intelligence
Carney, R. N., & Schattgen, S. F. (1994). California Scale-Revised. Psychological Reports, 61, 83–86.
Achievement Tests (5th ed.). In D. J. Keyser, & R. C. Carvajal, H. (1987b). 1986 Stanford-Binet abbreviated
Sweetland (Eds.), Test critiques (Vol. X, pp. 110– forms. Psychological Reports, 61, 285–286.
119). Austin, TX: Pro-ed. Carvajal, H. (1987c). Relationship between scores of
Carp, A. L., & Shavzin, A. R. (1950). The susceptibility young adults on Stanford-Binet IV and Peabody Pic-
to falsification of the Rorschach psychodiagnostic ture Vocabulary Test-Revised. Perceptual and Motor
technique. Journal of Consulting Psychology, 14, 230– Skills, 65, 721–722.
233. Carvajal, H. (1988). Relationships between scores on
Carpenter, P. A., Just, M. A., & Shell, P. (1990). Stanford-Binet IV and scores on McCarthy Scales
What one intelligence test measures: A theoreti- of Children’s Abilities. Bulletin of the Psychonomic
cal account of the processing in the Raven Pro- Society, 26, 349.
gressive Matrices Test. Psychological Review, 97, Carvajal, H. (1991). Relationships between scores
404–431. on Wechsler Preschool and Primary Scale of
Carr, A. C., Ghosh, A., & Ancil, R. J. (1983). Can a intelligence-Revised and Stanford-Binet IV. Psycho-
computer take a psychiatric history? Psychological logical Reports, 69, 23–26.
Medicine, 13, 151–158. Carver, R. P. (1974). Two dimensions of tests: Psycho-
Carr, A. C., Wilson, S. L., Ghosh, A., Ancill, R. J., & metric and edumetric. American Psychologist, 29,
Woods, R. T. (1982). Automated testing of geri- 512–518.
atric patients using a microcomputer-based system. Carver, R. P. (1992). Reliability and validity of the
International Journal of Man-Machine Studies, 17, Speed of Thinking test. Educational and Psychologi-
297–300. cal Measurement, 52, 125–134.
Carraher, S. M. (1993). Another look at the dimension- Cascio, W. F. (1975). Accuracy of verifiable biographi-
ality of a learning style questionnaire. Educational cal information blank responses. Journal of Applied
and Psychological Measurement, 53, 411–415. Psychology, 60, 767–769.
Carrier, M. R., Dalessio, A. T., & Brown, S. H. (1990). Cascio, W. F., Alexander, R. A., & Barrett, G. V. (1988).
Correspondence between estimates of content and Setting cutoff scores: Legal, psychometric, and pro-
criterion-related validity values. Personnel Psychol- fessional issues and guidelines. Personnel Psychology,
ogy, 43, 85–100. 41, 1–24.
Carroll, J. B. (1952). Ratings on traits measured by a Cascio, W. F., Outtz, J., Zedeck, S., & Golstein, I. L.
factored personality inventory. Journal of Abnormal (1991). Statistical implications of six methods of
and Social Psychology, 47, 626–632. test score use in personnel selection. Human Per-
Carroll, J. B. (1982). The measurement of intelligence. formance, 4, 233–264.
In R. J. Sternberg (Ed.), Handbook of human intel- Casey, T. A., Kingery, P. M., Bowden, R. G., & Corbett,
ligence (pp. 29–120). Cambridge, MA: Cambridge B. S. (1993). An investigation of the factor structure
University Press. of the Multidimensional Health Locus of Control
Carrow, E. (1973). Carrow Elicited Language Inventory. Scales in a health promotion program. Educational
Austin, TX: Learning Concepts. and Psychological Measurement, 53, 491–498.
Carson, K. P., & Gilliard, D. J. (1993). Construct valid- Cassisi, J. E., & Workman, D. E. (1992). The detection
ity of the Miner Sentence Completion Scale. Journal of malingering and deception with a short form of
of Occupational and Organizational Psychology, 66, the MMPI-2 based on the L, F, and K scales. Journal
171–175. of Clinical Psychology, 48, 54–58.
Carstensen, L. L., & Cone, J. D. (1983). Social desir- Castro, F. G., Furth, P., & Karlow, H. (1984). The health
ability and the measurement of psychological well- beliefs of Mexican, Mexican American, and Anglo
being in elderly persons. Journal of Gerontology, 38, American women. Hispanic Journal of Behavioral
713–715. Sciences, 6, 365–383.
Carter, B. D., Zelko, F. A. J., Oas. P. T., & Waltonen, Cates, J. A. (1991). Comparison of human figure draw-
S. (1990). A comparison of ADD/H children and ings by hearing and hearing-impaired children. The
clinical controls on the Kaufman Assessment Battery Volta Review. 93, 31–39.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

550 References

Catron, D. W., & Thompson, C. C. (1979). Test-retest Chen, G. M. (1994). Social desirability as a predictor of
gains in WAIS scores after four retest intervals. Jour- argumentativeness and communication apprehen-
nal of Clinical Psychology, 35, 352–357. sion. The Journal of Psychology, 128, 433–438.
Cattell, R. B. (1943). The description of personality: Chernyshenko, O. S., & Ones, D. S. (1999). How selec-
I. Foundations of trait measurement. Psychological tive are psychology graduate programs? The effect
Review, 50, 559–594. of the selection ratio on GRE score validity. Educa-
Cattell, R. B. (1950). Personality: A systematic theoret- tional and Psychological Measurement, 59, 951–961.
ical and factual study. New York: McGraw-Hill. Cherrick, H. M. (1985). Review of Dental Admission
Cattell, R. B. (1963). Theory of fluid and crystallized Test. In J. V. Mitchell, Jr. (Ed.), The ninth mental mea-
intelligence: A critical experiment. Journal of Edu- surements yearbook (Vol. 1, pp. 448–449). Lincoln:
cational Psychology, 54, 1–22. University of Nebraska Press.
Cattell, R. B. (1986). Structured tests and functional Chinese Culture Connection. (1987). Chinese values
diagnoses. In R. B. Cattell & R. C. Johnson (Eds.), and the search for culture-free dimensions of cul-
Functional psychological testing. New York: Brunner- ture. Journal of Cross-Cultural Psychology, 18, 143–
Mazel. 164.
Cattell, R. B. (1987). Intelligence: Its structure, growth Christensen, H. T., & Carpenter, G. R. (1962). Value-
and action. New York: North Holland. behavior discrepancies regarding pre-marital coitus
Cattell, R. B., Cattell, A. K., & Cattell, H. E. (1993). in three Western cultures. American Sociological
Sixteen Factor Questionnaire (5th ed.). Champaign, Review, 27, 66–74.
IL: Institute for Personality and Ability Testing. Christiansen, N. D., Goffin, R. D., Johnston, N. G.,
Cattell, R. B., Eber, H. W., & Tatsuoka, M. M. (1970). & Rothstein, M. G. (1994). Correcting the 16PF for
Handbook for the Sixteen Personality Factor Ques- faking: Effects on criterion-related validity and indi-
tionnaire. Champaign, IL: Institute for Personality vidual hiring decisions. Personnel Psychology, 47,
and Ability Testing. 847–860.
Cautilli, P. G., & Bauman, M. K. (1987). Assessment Chua, S. L., Chen, D., & Wong, A. F. L. (1999). Com-
of the visually impaired client. In B. Bolton (Ed.), puter anxiety and its correlates: A metaanalysis.
Handbook of measurement and evaluation in reha- Computers in Human Behavior, 15, 609–623.
bilitation (2nd ed., pp. 249–262). Baltimore, MD: Cibis, G. W., Maino, J. H., Crandall, M. A., Cress, P.,
Paul H. Brookes. Spellman, C. R., & Shores, R. E. (1985). The Parsons
Cavell, T. A., & Kelley, M. L. (1994). The Checklist of Visual Acuity Test for screening children 18 to 48
Adolescent Problem Situations. Journal of Clinical months old. Annals of Ophthalmology, 17, 471–478.
Child Psychology, 23, 226–238. Clarizio, H. F. (1982). Intellectual assessment of His-
Chambers, D. W. (1983). Stereotypic images of the panic children. Psychology in the Schools, 19, 61–71.
scientist: The Draw-A-Scientist Test. Science Educa- Clark, J. B., & Madison, C. L. (1984). Manual for the
tion, 67, 255–265. Test of Oral Language. Austin, TX: Pro-ed.
Champion, V. (1984). Instrument development for Clark, L., Gresham, F. M., & Elliott, S. N. (1985). Devel-
health belief model constructs. Advances in Nurs- opment and validation of a social skills assessment
ing Science, 6, 73–85. measure. Journal of Psychoeducational Assessment, 3,
Chan, C. K. K., Burtis, P. J., Scardamalia, M., & Bereiter, 347–356.
C. (1992). Constructive activity in learning from Clark, M. J., & Grandy, J. (1984). Sex differences in the
text. American Educational Research Journal, 29, 97– academic performance of Scholastic Aptitude Test
118. takers. College Board Report No. 84–8. New York:
Chara, P. J., Jr., & Verplanck, W. S. (1986). The Imagery College Board Publications.
Questionnaire: An investigation of its validity. Per- Cleary, T. A. (1968). Test bias: Prediction of grades
ceptual and Motor Skills, 63, 915–920. of Negro and white students in integrated colleges.
Chartier, G. M. (1971). A-B therapist variable: Real or Journal of Educational Measurement, 5, 115–124.
imagined? Psychological Bulletin, 75, 22–33. Cleary, T. A., Humphreys, L. G., Kendrick, S. A., &
Chase, C. I. (1985). Review of the Torrance Tests of Wesman, A. (1975). Educational uses of tests with
Creative Thinking. In J. V. Mitchell, Jr. (Ed.), The disadvantaged students. American Psychologist, 30,
ninth mental measurements yearbook (pp. 1486– 15–41.
1487). Lincoln, NB: University of Nebraska. Coan, R. W., Fairchild, M. T., & Dobyns, Z. P. (1973).
Chattin, S. H. (1989). School psychologists’ evaluation Dimensions of experienced control. The Journal of
of the K-ABC, McCarthy scales, Stanford-Binet IV, Social Psychology, 91, 53–60.
and WISC-R. Journal of Psychoeducational Assess- Coates, S., & Bromberg, P. M. (1973). Factorial struc-
ment, 7, 112–130. ture of the Wechsler Preschool and Primary Scale of
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 551

Intelligence between the ages of 4 and 6 1/2. Journal Cole, N. S. (1973). Bias in selection. Journal of Educa-
of Consulting and Clinical Psychology, 40, 365–370. tional Measurement, 10, 237–255.
Cofer, C. N., Chance, J., & Judson, A. J. (1949). A Coleman, J. S. (1966). Equality of Educational Opportu-
study of malingering on the Minnesota Multiphasic nity. Washington, DC: U.S. Department of Health,
Personality Inventory. The Journal of Psychology, 27, Education, and Welfare.
491–499. College Entrance Examination Board. (1988). 1988
Coffin, T. E. (1941). Some conditions of suggestion profile of SAT and achievement test takers. New York:
and suggestibility: A study of certain attitudinal and Author.
situational factors influencing the process of sug- Coller, A. R. (1971). Self-concept measures: An anno-
gestion. Psychological Monographs, 53 (Whole No. tated bibliography. Princeton, NJ: Educational Test-
241). ing Service.
Coffman, W. E. (1985). Review of the Structure of Collins, J. M., & Schmidt, F. L. (1993). Personality,
Intellect Learning Abilities Test. In J. V. Mitchell, integrity, and white collar crime: A construct validity
Jr. (Ed.), The ninth mental measurements year- study. Personnel Psychology, 46, 295–311.
book (pp. 1486–1488). Lincoln, NE: University of Compas, B. E., Davis, G. E., Forsythe, C. J., & Wagner,
Nebraska Press. B. M. (1987). Assessment of major and daily stress-
Cohen, B. H., & Saslona, M. (1990). The advantage ful events during adolescence: The Adolescent Per-
of being an habitual visualizer. Journal of Mental ceived Events Scale. Journal of Consulting and Clin-
Imagery, 14, 101–112. ical Psychology, 55, 534–541.
Cohen, J. (1950). Wechsler Memory Scale performance Comrey, A. L. (1958). A factor analysis of items on the
of psychoneurotic, organic, and schizophrenic F scale of the MMPI. Educational and Psychological
groups. Journal of Consulting Psychology, 14, 371– Measurement, 18, 621–632.
375. Comunian, A. L. (1985). The development and vali-
Cohen, J. (1957). The factorial structure of the WAIS dation of the Italian form of the Test Anxiety Inven-
between early adulthood and old age. Journal of Con- tory. In H. M. Van Der Ploeg, R. Schwarzer, & C. D.
sulting Psychology, 21, 283–290. Spielberger (Eds.), Advances in Test Anxiety Research
Cohen, J. (1960). A coefficient of agreement for nom- (Vol. 4, pp. 215–220). Lisse, Netherlands: Swets &
inal scales. Educational and Psychological Measure- Zeitlinger.
ment, 20, 37–46. Cone, J. D. (1977). The relevance of reliability and
Cohen, J., & Cohen, P. (1983). Applied multiple validity for behavioral assessment. Behavior Ther-
regression/correlation analysis for the behavioral sci- apy, 8, 411–426.
ences (2nd ed.). Hillsdale, NJ: Erlbaum. Cone, J. D. (1978). The Behavioral Assessment Grid
Cohen, J., & Lefkowitz, J. (1974). Development of a (BAG): A conceptual framework and a taxonomy.
biographical inventory blank to predict faking on Behavior Therapy, 9, 882–888.
personality tests. Journal of Applied Psychology, 59, Cone, J. D., & Foster, S. L. (1991). Training in
404–405. measurement: Always the bridesmaid. American
Cohen, J. B. (1990). Misuse of computer software to Psychologist, 46, 653–654.
detect faking on the Rorschach: A reply to Kahn, Conners, C. K. (1969). A Teacher Rating Scale for use
Fox, and Rhode. Journal of Personality Assessment, in drug studies with children. American Journal of
54, 58–62. Psychology, 126, 884–888.
Cohn, S. J. (1985). Review of Graduate Record Exam- Conners, C. K. (1970). Symptom patterns in hyperki-
ination. In J. V. Mitchell, Jr. (Ed.), The ninth men- netic, neurotic, and normal children. Child Devel-
tal measurements yearbook (Vol. 1, pp. 622–624). opment, 41, 667–682.
Lincoln: University of Nebraska Press. Conners, C. K. (1990). Conners Rating Scales manual.
Cole, D. A. (1987). Utility of confirmatory factor anal- North Tonawanda, NY: Multi-Health Systems.
ysis in test validation research. Journal of Consulting Conners, C. K., Sitarenios, G., Parker, J. D. A., &
and Clinical Psychology, 55, 584–594. Epstein, J. N. (1998). The revised Conner’s Parent
Cole, K. N., & Fewell, R. R. (1983). A quick language Rating Scale (CPRS-R): Factor structure, reliabil-
screening test for young children: The token test. ity, and criterion validity. Journal of Abnormal Child
Journal of Psycoeducational Assessment, 1, 149–154. Psychology, 26, 257–280.
Cole, N. (1982). The implications of coaching for Conoley, J. C., & Bryant, L. E. (1995). Multi-
ability testing. In A. Wigdor & W. Garner (Eds.), cultural family assessment. In J. C. Conoley &
Ability testing: Uses, consequences, and controversies. E. B. Werth (Eds.), Family assessment (pp. 103–
(pp. 389–414). Washington, DC: National Academy 129). Lincoln, NE: Buros Institute of Mental
Press. Measurements.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

552 References

Coombs, C. H. (1950). Psychological scaling without Costa, P. T., Jr., & McCrae, R. R. (1985). The NEO
a unit of measurement. Psychological Review, 57, Personality Inventory manual. Odessa, FL: Psycho-
145–158. logical Assessment Resources.
Coons, P. M., & Fine, C. G. (1990). Accuracy of the Costa, P. T., Jr., & McCrae, R. R. (1989). NEO-
MMPI in identifying multiple personality disorder. PI/FFI manual supplement. Odessa, FL: Psychologi-
Psychological Reports, 66, 831–834. cal Assessment Resources.
Cooper, E. (1991). A critique of six measures for Costa, P. T., Jr., & McCrae, R. R. (1992). Normal per-
assessing creativity. Journal of Creative Behavior, 25, sonality assessment in clinical practice: The NEO
194–204. Personality Inventory. Psychological Assessment, 4,
Cooper, K. L., & Gutmann, D. L. (1987). Gender 5–13.
identity and ego mastery style in middle-aged, pre- Costantino, G., Malgady, R., & Rogler, L. H. (1988).
and post-empty nest women. The Gerontologist, 27, Tell-Me-a-Story-TEMAS-Manual. Los Angeles, CA:
347–352. Western Psychological Services.
Coopersmith, S. (1969). A method of determining Costantino, G., Malgady, R. G., Rogler, L. H., &
types of self-esteem. Journal of Abnormal Psychol- Tsui, E. C. (1988). Discriminant analysis of clin-
ogy, 59, 87–94. ical outpatients and public school children by
Coopersmith, S. (1967). The antecedents of self-esteem. TEMAS: A Thematic Apperception Test for Hispan-
San Francisco, CA: W. H. Freeman. ics and Blacks. Journal of Personality Assessment, 52
Corach, N. L., Feldman, M. J., Cohen, I. S., Gruen, W., 670–678.
Meadow, A., & Ringwall, E. A. (1958). Social desir- Couch, A., & Keniston, K. (1960). Yeasayers and
ability as a variable in the Edwards Personal Prefer- naysayers: Agreeing response set as a personality
ence Schedule. Journal of Consulting Psychology, 22, variable. Journal of Abnormal and Social Psychology,
70–72. 60, 151–174.
Corach, N. L., & Powell, B. L. (1963). A factor ana- Court, J. H. (1983). Sex differences in performance on
lytic study of the Frostig Developmental Test of Raven’s Progressive Matrices: A review. The Alberta
Visual Perception. Perceptual and Motor Skills, 16, Journal of Educational Research, 29, 54–74.
39–63. Court, J. H. (1988). A researcher’s bibliography for
Cornell, D. G., & Hawk, G. L. (1989). Clinical pre- Raven’s Progressive Matrices and Mill Hill Vocabulary
sentation of malingerers diagnosed by experienced scales (7th ed.). Cumberland Park, South Australia:
forensic psychologists. Law and Human Behavior, author.
13, 375–383. Court, J. H., & Raven, J. C. (1982). Research and
Cornwell, J. M., Manfredo, P. A., & Dunlap, W. P. references: 1982 update. London: H. K. Lewis.
(1991). Factor analysis of the 1985 revision of Kolb’s Covin, T. M. (1977). Comparison of WISC and WISC-
Learning Style Inventory. Educational and Psycho- R full scale IQs for a sample of children in special
logical Measurement, 51, 455–462. education. Psychological Reports, 41, 237–238.
Cortese, M., & Smyth, P. (1979). A note on the transla- Craig, H. B. (1965). A sociometric investigation of the
tion to Spanish of a measure of acculturation. His- self-concept of the deaf child. American Annals of
panic Journal of Behavioral Sciences, 1, 65–68. the Deaf , 110, 456–474.
Cortina, J. M. (1993). What is coefficient alpha? An Craig, R. J. (1999). Overview and current status of the
examination of theory and applications. Journal of Millon Clinical Multiaxial Inventory. Journal of Per-
Applied Psychology, 78, 98–104. sonality Assessment, 72, 390–406.
Cortina, J. M., Doherty, M. L., Schmitt, N., Kaufman, Craik, K. H. (1971). The assessment of places. In P.
G., & Smith, R. G. (1992). The “Big Five” personality McReynolds (Ed.), Advances in psychological assess-
factors in the IPI and MMPI: Predictors of police ment (Vol. 2, pp. 40–62). Palo Alto, CA: Science &
performance. Personnel Psychology, 45, 119–140. Behavior Books.
Cosden, M. (1987). Rotter Incomplete Sentences Cramer, P. (1999). Future directions for the Thematic
Blank. In D. J. Keyser, & R. C. Sweetland (Eds.), Test Apperception Test. Journal of Personality Assessment,
critiques compendium. Kanas City, MO: Test Corpo- 72, 74–92.
ration of America. Crary, W. G., & Johnson, C. W. (1975). The mental
Costa, P. T., Jr., & McCrae, R. R. (1980). Still stable status examination. In C. W. Johnson, J. R. Snibbe,
after all these years: Personality as a key to some & L. E. Evans (Eds.), Basic psychopathology: A pro-
issues in adulthood and old age. In P. B. Baltes & grammed text (pp. 50–89). New York: Spectrum.
O. G. Brim, Jr. (Eds.), Life span development and Creaser, J. W. (1960). Factor analysis of a study-habits
behavior (Vol. 3, pp. 65–102). New York: Academic Q-sort test. Journal of Counseling Psychology, 7, 298–
Press. 300.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 553

Cress, P. J. (1987). Visual assessment. In M. Bullis Crook, T. H., & Larrabee, G. J. (1990). A self-rating
(Ed.), Communication development in young chil- scale for evaluating memory in everyday life. Psy-
dren with deaf-blindness: Literature Review III. Mon- chology and Aging, 5, 48–57.
mouth, OR: Teaching Research Division of Oregon Crosby, L. A., Bitner, M. J., & Gill, J. D. (1990). Orga-
State System of Higher Education. nizational structure of values. Journal of Business
Crites, J. O. (1978). Career Maturity Inventory. Mon- Research, 20, 123–134.
terey, CA: CTB/McGraw-Hill. Cross, L. H., Impara, J. C., Frary, R. B., & Jaeger, R. M.
Crockett, B. K., Rardin, M. W., & Pasewark, R. A. (1984). A comparison of three methods for estab-
(1975). Relationship between WPPSI and Stanford- lishing minimum standards on the National Teacher
Binet IQs and subsequent WISC IQs in Headstart Examinations. Journal of Educational Measurement,
children. Journal of Consulting and Clinical Psychol- 21, 113–129.
ogy, 43, 922. Cross, O. H. (1947). Braille adaptation of the Min-
Crockett, B. K., Rardin, M. W., & Pasewark, R. A. nesota Multiphasic Personality Inventory for use
(1976). Relationship of WPPSI and subsequent with the blind. Journal of Applied Psychology, 31,
Metropolitan Achievement Test scores in Headstart 189–198.
children. Psychology in the Schools, 13, 19–20. Crouse, J. (1985). Does the SAT help colleges make bet-
Cronbach, L. J. (1942). Studies of acquiescence as a ter selection decisions? Harvard Educational Review,
factor in the true-false test. Journal of Educational 55, 195–219.
Psychology, 33, 401–415. Crowne, D. P., & Marlowe, D. (1960). A new scale of
Cronbach, L. J. (1946). Response sets and test validity. social desirability independent of psychopathology.
Educational and Psychological Measurement, 6, 475– Journal of Consulting Psychology, 24, 349–354.
494. Crowne, D. P., & Marlowe, D. (1964). The approval
Cronbach, L. J. (1949). Statistical methods applied to motive: Studies in evaluative dependence. New York:
Rorschach scores: A review. Psychological Bulletin, Wiley.
46, 393–429. Cuellar, I., Harris, L. C., & Jasso, R. (1980). An accul-
Cronbach, L. J. (1950). Further evidence on response turation scale for Mexican American normal and
sets and test design. Educational and Psychological clinical populations. Hispanic Journal of Behavioral
Measurement, 10, 3–31. Sciences, 2, 199–217.
Cronbach, L. J. (1951). Coefficient alpha and the inter- Cummings, J. A. (1986). Projective drawings. In H. M.
nal structure of tests. Psychometrika, 16, 297–334. Knoff (Ed.), The assessment of child and adolescent
Cronbach, L. J. (1970). Essentials of psychological test- personality (pp. 199–244). New York: Guilford Press.
ing (3rd ed.). New York: Harper & Row. Cummings, J. A. (1989). Review of the Structure of
Cronbach, L. J. (1980). Validity on parole: How can we Intellect Learning Abilities Test. In J. C. Conoley &
go straight? New Directions in Test Measurement, 5, J. J. Kramer (Eds.), The tenth mental measurements
99–108. yearbook (pp. 787–791). Lincoln, NE: University of
Cronbach, L. J. (1988). Five perspectives on the validity Nebraska Press.
argument. In R. Wainer, & H. I. Braun (Eds.), Test Cummings, N. A. (1986). Assessing the computer’s
validity (pp. 3–17). Hillsdale, NJ: Erlbaum. impact: Professional concerns. Computers in Human
Cronbach, L. J., & Gleser, G. C. (1965). Psychological Behavior, 1, 293–300.
tests and personnel decisions. Urbana, IL: University Cummins, J. P., & Das, J. P. (1980). Cognitive pro-
of Illinois Press. cessing, academic achievement, and WISC-R per-
Cronbach, L. J., Gleser, G. C., Rajaratnam, N., & formance in EMR children. Journal of Consulting
Nanda, H. (1972). The dependability of behavioral and Clinical Psychology, 48, 777–779.
measurements. New York: Wiley. Cunningham, M. R., Wong, D. T., & Barbee,
Cronbach, L. J., & Meehl, P. E. (1955). Construct valid- A. P. (1994). Self-presentation dynamics on overt
ity in psychological tests. Psychological Bulletin, 52, integrity tests: Experimental studies of the Reid
281–302. Report. Journal of Applied Psychology, 79, 643–658.
Crook, T. H. (1979). Psychometric assessment in the Cureton, E. E. (1960). The rearrangement test.
elderly. In A. Raskin & L. F. Jarvik (Eds.), Psy- Educational and Psychological Measurement, 20,
chiatric symptoms and cognitive loss in the elderly: 31–35.
Evaluation and assessment techniques (pp. 207–220). Cureton, E. E. (1965). Reliability and validity: Basic
Washington, DC: Hemisphere. assumptions and experimental designs. Educational
Crook, T. H., & Larrabee, G. J. (1988). Interrelation- and Psychological Measurement, 25, 327–346.
ships among everyday memory tests: Stability of fac- Cureton, E. E., Cook, J. A., Fischer, R. T., Laser,
tor structure of age. Neuropsychology, 2, 1–12. S. A., Rockwell, N. J., & Simmons, J. W. (1973).
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

554 References

Length of test and standard error of measurement. Davis, G. A., & Bull, K. S. (1978). Strengthening
Educational and Psychological Measurement, 33, affective components of creativity in a college
63–68. course. Journal of Educational Psychology, 70, 833–
Cyr, J. J., Doxey, N. C. S., & Vigna, C. M. (1988). Fac- 836.
torial composition of the SCL-90R. Journal of Social Davis, W. E. (1975). Race and the differential “power”
Behavior and Personality, 3, 245–252. of the MMPI. Journal of Personality Assessment, 39,
Dahlstrom, W. G. (1993). Tests: Small samples, 141–145.
large consequences. American Psychologist, 48, Davis, W. E., Beck, S. J., & Ryan, T. A. (1973). Race-
393–399. related and educationally-related MMPI profile dif-
Dahlstrom, W. G., Brooks, J. D., & Peterson. C. D. ferences among hospitalized schizophrenics. Journal
(1990). The Beck Depression Inventory: Item order of Clinical Psychology, 29, 478–479.
and the impact of response sets. Journal of Person- Davison, L. A. (1974). Current status of clinical
ality Assessment, 55, 224–233. neuropsychology: In R. M. Reitan & L. A. Davison
Dahlstrom, W. G., Lachar, D., & Dahlstrom, L. E. (Eds.), Clinical neuropsychology: Current status and
(1986). MMPI patterns of American minorities. applications. New York: Hemisphere.
Minneapolis: University of Minnesota Press. Davison, M. L. (1985). Multidimensional scaling ver-
Dahlstrom, W. G., Welsh, G. S., & Dahlstrom, sus components analysis of test intercorrelations.
L. E. (1972). An MMPI handbook: Vols. 1 and 2, Psychological Bulletin, 97, 94–105.
Minneapolis: University of Minnesota Press. Dawes, R. M. (1962). A note on base rates and psycho-
Dana, R. H., Feild, K., & Bolton, B. (1983). Variations metric efficiency. Journal of Consulting Psychology,
of the Bender Gestalt Test: Implications for training 26, 422–424.
and practice. Journal of Personality Assessment, 47, Dean, R. S. (1977). Reliability of the WISC-R with
76–84. Mexican-American children. Journal of School Psy-
D’Andrade, R. G. (1965). Trait psychology and compo- chology, 15, 267–268.
nential analysis. American Anthropologist, 67, 215– Dean, R. S. (1978). Distinguishing learning-disabled
228. and emotionally disturbed children on the WISC-
Dannenbaum, S. E., & Lanyon, R. I. (1993). The use R. Journal of Consulting and Clinical Psychology, 46,
of subtle items in detecting deception. Journal of 381–382.
Personality Assessment, 61, 501–510. Dean, R. S. (1979). Predictive validity of the WISC-R
Darlington, R. B. (1976). A defense of “rational” per- with Mexican-American children. Journal of School
sonnel selection, and two new methods. Journal of Psychology, 17, 55–58.
Educational Measurement, 13, 43–52. Dean, R. S. (1980). Factor structure of the WISC-R
Das, J. P. (1989). Review of the Explorer. In J. C. Cono- with Anglos and Mexican-Americans. Journal of
ley & J. J. Kramer (Eds.), The tenth mental measure- School Psychology, 18, 234–239.
ments yearbook (pp. 888–889). Lincoln, NE: Univer- Dearborn, G. V. (1898). A study of imaginations. Amer-
sity of Nebraska Press. ican Journal of Psychology, 9, 183–190.
Das, J. P., Kirby, J. R., & Jarman, R. F. (1979). Simul- Debra P. v. Turlington, 644 F. 2d 397 (5th Cir. 1981).
taneous and successive cognitive processes. New York: Deffenbacher, J. L. (1980). Worry and emotionality in
Academic Press. test anxiety. In I. G. Sarason (Ed.), Test anxiety: The-
Das, J. P., Naglieri, J. A., & Kirby, J. R. (1994). Assess- ory, research, and applications (pp. 111–128). Hills-
ment of cognitive processes: The PASS. New York: dale, NJ: Erlbaum.
Allyn & Bacon. DeFrancesco, J. J., & Taylor, J. (1993). A valida-
Davey, T., Godwin, J., & Mittelholtz, D. (1997). Devel- tional note on the revised Socialization scale of
oping and scoring an innovative computerized writ- the California Psychological Inventory. Journal of
ing assessment. Journal of Educational Measurement, Psychopathology and Behavioral Assessment, 15, 53–
34, 21–41. 56.
Davidson, K., & MacGregor, M. W. (1996). Reliability DeGiovanni, I. S., & Epstein, N. (1978). Unbinding
of an idiographic Q-sort measure of defense mech- assertion and aggression in research and clinical
anisms. Journal of Personality Assessment, 66, 624– practice. Behavior Modification, 2, 173–192.
639. deJonghe, J. F. M., & Baneke, J. J. (1989). The Zung
Davis, C. (1980). Perkins-Binet Tests of Intelligence for Self-Rating Depression Scale: A replication study
the Blind. Watertown, MA: Perkins School for the on reliability, validity and prediction. Psychological
Blind. Reports, 64, 833–834.
Davis, G. A. (1986). Creativity is forever (2nd ed.). Dekker, R., Drenth, P. J. D., Zaal, J. N., & Koole, F. D.
Dubuque, IA: Kendall-Hunt. (1990). An intelligence test series for blind and low
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 555

vision children. Journal of Visual Impairment and to egocentric communication in female preschool-
Blindness, 84, 71–76. ers. Developmental Psychology, 10, 745–747.
Delaney, E. A., & Hopkins, T. F. (1987). Examiner’s DeVito, A. J. (1984). Review of Test Anxiety Inventory.
handbook: An expanded guide for fourth edition users. In D. J. Keyser & R. C. Sweetland (Eds.), Test Cri-
Chicago, IL: Riverside. tiques (Vol. 1, pp. 673–681). Kansas City, MO: Test
Delhees, K. H., & Cattell, R. B. (1971). Manual for Corporation of America.
the Clinical Analysis Questionnaire (CAQ). Cham- DeVito, A. J. (1985). Review of Myers-Briggs Type Indi-
paign, IL: Institute for Personality and Ability cator. In J. V. Mitchell, Jr. (Ed.), The ninth men-
Testing. tal measurements yearbook (Vol. 2, pp. 1030–1032).
DeLongis, A., Coyne, J. C., Dakof, G., Folkman, S., & Lincoln, NB: University of Nebraska Press.
Lazarus, R. S. (1982). Relationship of daily hassles, Dewing, K. (1970). Family influences on creativity: A
uplifts, and major life events to health status. Health review and discussion. The Journal of Special Edu-
Psychology, 1, 119–136. cation, 4, 399–404.
DeLongis, A., Folkman, S., & Lazarus, R. S. (1988). Deyo, R. A., Diehl, A. K., Hazuda, H., & Stern, M. P.
The impact of daily stress on health and mood: (1985). A simple language-based acculturation scale
Psychological and social resources as mediators. for Mexican Americans: Validation and application
Journal of Personality and Social Psychology, 54, to health care research. American Journal of Public
486–495. Health, 75, 51–55.
DeLuty, R. H. (1988–1989). Physical illness, psychi- Diamond, J. J., & Evans, W. J. (1972). An investigation
atric illness, and the acceptability of suicide. Omega, of the cognitive correlates of testwiseness. Journal of
19, 79–91. Educational Measurement, 9, 145–150.
DeMinzi, M. C. R. (1990). A new multidimensional Diamond, J. J., & Evans, W. J. (1973). The correction for
Children’s Locus of Control Scale. The Journal of guessing. Review of Educational Research, 43, 181–
Psychology, 125, 109–118. 191.
Denman, S. (1984). Denman Neuropsychological Mem- Dicken, C. (1963). Good impression, social desirabil-
ory Scale manual. Charleston, SC: Author. ity, and acquiescence as suppressor variables. Educa-
Denney, D. R., & Sullivan, B. J. (1976). De sensitization tional and Psychological Measurement, 23, 699–720.
and modeling treatments of spider fear using two Dickinson, T. L., & Zellinger, P. M. (1980). A compar-
types of scenes. Journal of Consulting and Clinical ison of the behaviorally anchored rating and mixed
Psychology, 44, 573–579. standard scale formats. Journal of Applied Psychol-
Dennis, K. E. (1986). Q methodology: Relevance and ogy, 65, 147–154.
application to nursing research. Advances in Nursing Digman, J. M. (1989). Five robust trait dimensions:
Science, 8, 6–17. Development, stability, and utility. Journal of Per-
Denton, C., & Postwaithe, K. (1985). Able children: sonality, 57, 195–214.
Identifying them in the classroom. Philadelphia, PA: Digman, J. M. (1990). Personality structure: Emer-
Nfer-Nelson. gence of the five-factor model. Annual Review of
DeRenzi, E., & Vignolo, L. (1962). The Token Test: Psychology, 41, 417–440.
A sensitive test to detect receptive disturbances in Digman, J. M., & Inouye, J. (1986). Further specifica-
aphasics. Brain, 85, 665–678. tion of the five robust factors of personality. Journal
Derogatis, L. R. (1987). Self-report measures of stress. of Personality and Social Psychology, 50, 116–123.
In L. Goldberger & S. Breznitz (Eds.), Hand- Digman, J. M., & Takemoto-Chock, N. K. (1981).
book of Stress (pp. 270–294). New York: The Free Factors in the natural language of personality:
Press. Re-analysis, comparison, and interpretation of six
Derogatis, L. R. (1977). The SCL-90 manual: Scoring, major studies. Multivariate Behavioral Research, 16,
administration, and procedures for the SCL-90. Bal- 149–170.
timore, MD: Johns Hopkins University School of Dillon, R. F., Pohlmann, J. T., & Lohman, D. F. (1981).
Medicine, Clinical Psychometrics Unit. A factor analysis of Raven’s Advanced Progressive
Derogatis, L. R. (1983). SCL-90R: Administration, scor- Matrices freed of difficulty factors. Educational and
ing and procedures manual II. Townson, MD: Clini- Psychological Measurement, 41, 1295–1302.
cal Psychometrics Research. Di Vesta, F. (1965). Developmental patterns in the use
Detterman, D. K. (1992) (Ed.), Current topics in human of modifiers as modes of conceptualization. Child
intelligence: Vol. 2. Is mind modular or unitary? Nor- Development, 36, 185–213.
wood, NJ: Ablex. Dobbins, G. H. (1990). How Supervise? In J. Hogan
Deutsch, F. (1974). Observational and sociometric & R. Hogan (Eds.), Business and Industry Testing
measures of peer popularity and their relationship (pp. 472–477). Austin, TX: Pro-ed.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

556 References

Dodd, S. C. (1935). A social distance test in the Domino, G., & Acosta, A. (1987). The relation of accul-
Near East. American Journal of Sociology, 41, 194– turation and values in Mexican Americans. Hispanic
204. Journal of Behavioral Sciences, 9, 131–150.
Dodrill, C. B., & Warner, M. H. (1988). Further studies Domino, G., & Affonso, D. D. (1990). A personality
of the Wonderlic Personnel Test as a brief measure of measure of Erikson’s life stages: The Inventory of
intelligence. Journal of Consulting and Clinical Psy- Psychosocial Balance. Journal of Personality Assess-
chology, 56, 145–147. ment, 54, 576–588.
Dohmen, P., Doll, J., & Feger, H. (1989). A component Domino, G., Affonso, D., & Hannah, M. T. (1991).
theory for attitude objects. In A. Upmeyer (Ed.), Assessing the imagery of cancer: The Cancer
Attitudes and behavioral decisions (pp. 19–59). New Metaphors Test. Journal of Psychosocial Oncology, 9,
York: Springer-Verlag. 103–121.
Dohrenwend, B. S., & Dohrenwend, B. P. (1978). Some Domino, G., & Blumberg, E. (1987). An application of
issues in research on stressful life events. Journal of Gough’s conceptual model to a measure of adoles-
Nervous and Mental Disease, 166, 7–15. cent self-esteem. Journal of Youth and Adolescence,
Doll, E. A. (1953). The measurement of social com- 16, 179–190.
petence. A manual for the Vineland Social Maturity Domino, G., Fragoso, A., & Moreno, H. (1991). Cross-
Scale. Philadelphia: Educational Test Bureau. cultural investigations of the imagery of cancer in
Dollinger, S. J. (1989). Predictive validity of the Grad- Mexican nationals. Hispanic Journal of Behavioral
uate Record Examination in a Clinical Psychology Sciences, 13, 422–435.
Program. Professional Psychology, 20, 56–58. Domino, G., Goldschmid, M., & Kaplan, M. (1964).
Dolliver, R. H. (1969). Strong Vocational Interest Personality traits of institutionalized mongoloid
Blank vs. expressed vocational interests: A review. girls. American Journal of Mental Deficiency, 68, 498–
Psychological Bulletin, 72, 95–107. 502.
Dolliver, R. H., Irwin, J. A., & Bigley, S. E. (1972). Domino, G., & Hannah, M. T. (1987). A comparative
Twelve-year follow-up of the Strong Vocational analysis of social values of Chinese and American
Interest Blank. Journal of Counseling Psychology, 19, children. Journal of Cross-Cultural Psychology, 18,
212–217. 58–77.
Domino, G. (1965). Personality traits in institutional- Domino, G., & Hannah, M. T. (1989). Measuring effec-
ized mongoloids. American Journal of Mental Defi- tive functioning in the elderly: An application of
ciency, 69, 541–547. Erikson’s theory. Journal of Personality Assessment,
Domino, G. (1968). A non-verbal measure of intelli- 53, 319–328.
gence for totally blind adults. The New Outlook for Domino, G., & Lin, J. (1991). Images of cancer: China
the Blind, 62, 247–252. and the United States. Journal of Psychosocial Oncol-
Domino, G. (1970). The identification of potentially ogy, 9, 67–78.
creative persons from the Adjective Check List. Domino, G., & Lin, W. Y. (1993). Cancer metaphors:
Journal of Consulting and Clinical Psychology, 35, Taiwan and the United States. Personality and Indi-
48–51. vidual Differences, 14, 693–700.
Domino, G. (1979). Creativity and the home environ- Domino, G., & Pathanapong, P. (1993). Cancer
ment. Gifted Child Quarterly, 23, 818–828. imagery: Thailand and the United States. Personality
Domino, G. (1980). Chinese Tangrams as a technique and Individual Differences, 14, 693–700.
to assess creativity. Journal of Creative Behavior, 14, Domino, G., & Regmi, M. P. (1993). Attitudes toward
204–213. cancer: A cross-cultural comparison of Nepalese and
Domino, G. (1982). Get high on yourself: The effec- U.S. students. Journal of Cross-Cultural Psychology,
tiveness of a television campaign on self-esteem, 24, 389–398.
drug use, and drug attitudes. Journal of Drug Edu- Donnelly, M., Yindra, K., Long, S. Y., Rosenfeld, P.,
cation, 12, 163–171. Fleisher, D., & Chen, C. (1986). A model for pre-
Domino, G. (1984). Measuring geographical envi- dicting performance on the NBME Part I Exam-
ronments through adjectives: The Environmental ination. Journal of Medical Education, 61, 123–
Check List. Psychological Reports, 55, 151–160. 131.
Domino, G. (1992). Cooperation and competition in Donovan, J. M. (1993). Validation of a Portuguese
Chinese and American children. Journal of Cross- form of Templer’s Death Anxiety Scale. Psychological
Cultural Psychology, 23, 456–467. Reports, 73, 195–200.
Domino, G. (1994). Assessment of creativity with the Dorans, N. J., & Lawrence, I. M. (1987). The internal
ACL: An empirical comparison of four scales. Cre- construct validity of the SAT. ETS Research Report,
ativity Research Journal, 7, 21–33. No. 87–35.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 557

Dorfman, L., & Mofett, M. (1987). Retirement satisfac- Duckworth, J. C. (1991). The Minnesota Multiphasic
tion in married and rural women. The Gerontologist, Personality Inventory-2: A review. Journal of Coun-
27, 215–221. seling and Development, 69, 564–567.
Dorken, H., & Greenbloom, G. H. (1953). Psycho- Dufault, K. J., & Martocchio, B. (1985). Hope: Its
logical investigation of senile dementia. II. The spheres and dimensions. Nursing Clinics of North
Wechsler-Bellevue Adult Intelligence Scale. Geri- America, 20, 379–391.
atrics, 8, 324–333. Dumont, R., & Faro, C. (1993). A WISC-III short
Doyle, D., & Forehand, M. J. (1984). Life satisfaction form for learning-disabled students. Psychology in
and old age. Research on Aging, 6, 432–448. the Schools, 30, 212–219.
Doyle, K. O., Jr. (1974). Theory and practice of ability Dunlap, W. R., & Sands, D. I. (1990). Classification of
testing in ancient Greece. Journal of the History of the hearing impaired for independent living using
the Behavioral Sciences, 10, 202–212. the Vineland Adaptive Behavior Scale. American
Dozois, D., Dobson, K., & Ahnberg, J. L. (1998). A Annals of the Deaf, 135, 384–388.
psychometric evaluation of the Beck Depression Dunn, L. M., & Dunn, L. (1981). Peabody Picture
Inventory-II. Psychological Assessment, 10, 83–89. Vocabulary Test-Revised. Cicle Pines, MN: Ameri-
Drake, L. E., & Oetting, E. R. (1959). An MMPI code- can Guidance Service.
book for counselors. Minneapolis: University of Min- Dunn, T. F., & Goldstein, L. G. (1959). Test diffi-
nesota Press. culty, validity, and reliability as functions of selected
Drasgow, F., & Olson-Buchanan, J. B. (Eds.). (1999). multiple-choice item construction principles. Edu-
Innovations in computerized assessment. Mahwah, cational and Psychological Measurement, 19, 171–
NJ: Erlbaum. 179.
Drewes. D. W. (1961). Development and validation of Dunn, T. G., Lushene, R. E., & O’Neil, H. F. (1972).
synthetic dexterity tests based on elemental motion Complete automation of the MMPI and a study of its
analysis. Journal of Applied Psychology, 45, 179–185. response latencies. Journal of Consulting and Clinical
Driver, B. L., & Knopf, R. C. (1977). Personality, out- Psychology, 39, 381–387.
door recreation, and expected consequences. Envi- Dunnett, S., Koun, S., & Barber, J. (1981). Social desir-
ronment and Behavior, 9, 169–193. ability in the Eysenck Personality Inventory. British
Droege, R. C. (1987). The USES Testing Program. In B. Journal of Psychology, 72, 19–26.
Bolton (Ed.), Handbook of measurement and evalu- Dunnette, M. D., McCartney, J., Carlson, H. C., &
ation in rehabilitation (2nd ed., pp. 169–182). Balti- Kirchner, W. K. (1962). A study of faking behavior
more, MD: Paul H. Brookes. on a forced-choice self-description checklist. Per-
Dror, I. E., Kosslyn, S. M., & Waag, W. L. (1993). Visual- sonnel Psychology, 15, 13–24.
spatial abilities of pilots. Journal of Applied Psychol- Dunst, C. J., & Gallagher, J. L. (1983). Piagetian
ogy, 78, 763–773. approaches to infant assessment. Topics in Early
Dubinsky, S., Gamble, D. J., & Rogers, M. L. (1985). Childhood Special Education, 3, 44–62.
A literature review of subtle-obvious items on the DuPaul, G. J. (1992). Five common misconcep-
MMPI. Journal of Personality Assessment, 49, 62–68. tions about the assessment of attention deficit
DuBois, L. M. (1985). Review of Dental Admission hyperactivity disorder. The School Psychologist, 10,
Test. In J. V. Mitchell, Jr. (Ed.), The ninth mental mea- 14–15.
surements yearbook (Vol. 1, pp. 449–450). Lincoln: Duran, R. P. (1983). Hispanics’ education and
University of Nebraska Press. background: Predictors of college achievement. New
DuBois, P. H. (1966). A test-dominated society: China York: College Entrance Examination Board.
1115 B.C. – 1905 A.D. In A. Anastasi (Ed.), Test- Dusek, J. B. (1980). The development of test anxiety in
ing problems in perspective (pp. 29–36). Washington, children. In I. G. Sarason (Ed.), Test anxiety. Hills-
DC: American Council on Education. dale, NJ: Erlbaum.
DuBose, R. F. (1976). Predictive value of infant intel- Dvorak, B. J. (1947). The new USES General Aptitude
ligence scales with multiple handicapped children. Test Battery. Occupations, 26, 42–44.
American Journal of Mental Deficiency, 81, 388–390. Dwyer, C., & Wicenciak, S. (1977). A pilot investigation
DuBose, R. F. (1981). Assessment of severely impaired of three factors of the 16 P. F., Form E, comparing the
young children: Problems and recommendations. standard written form with an Ameslan videotape
Topics in Early Childhood Special Education, 1, 9–21. revision. Journal of Rehabilitation of the Deaf, 10,
Dubuisson, D., & Melzack, R. (1976). Classification 17–23.
of clinical pain descriptions by multiple group Dyal, J. A. (1983). Cross-cultural research with the
discriminant analysis. Experimental Neurology, 51, locus of control construct. In H. M. Lefcourt (Ed.),
480–487. Research with the locus of control construct: Vol. 3.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

558 References

Extensions and limitations (pp. 209–306). New York: of dimensionality, measurement error, and differ-
Academic Press. ences in underlying constructs. Journal of Applied
Dyer, C. O. (1985). Review of the Otis-Lennon School Psychology, 75, 440–454.
Ability Test. In J. V. Mitchell, Jr. (Ed.), The ninth Eichman, W. J. (1962). Factored scales for the MMPI:
mental measurements yearbook (Vol. 2, pp. 1107– A clinical and statistical manual. Journal of Clinical
1111). Lincoln, NE: University of Nebraska Press. Psychology, 18, 363–395.
Eagly, A. H., & Chaiken, S. (1992). The psychology of Eippert, D. S., & Azen, S. P. (1978). A comparison of
attitudes. San Diego, CA: Harcourt Brace Janovich. two developmental instruments in evaluating chil-
Ebel, R. L. (1963). The social consequences of dren with Down’s syndrome. Physical Therapy, 58,
educational testing. Proceedings of the 1963 1066–1069.
Invitational Conference on Testing Problems. Eisdorfer, C., & Altrocchi, J. (1961). A comparison of
(pp. 130–143). Princeton, NJ: Educational Testing attitudes toward old age and mental illness. Journal
Service. of Gerontology, 21, 455–457.
Ebel, R. L. (1969). Expected reliability as a function Eiser, C. (1984). Communicating with sick and hos-
of choices per item. Educational and Psychological pitalized children. Journal of Child Psychology and
Measurement, 29, 565–570. Psychiatry, 25, 181–189.
Ebel, R. L. (1972). Essentials of educational measure- Eiser, J. R. (1987). The expression of attitude. New York:
ment. Englewood Cliffs, NJ: Prentice Hall. Springer-Verlag.
Edelbrock, C. S., & Rancurello, M. D. (1985). Child- Ekman, P. (1965). Differential communication of
hood hyperactivity: An overview of rating scales affect by head and body cues. Journal of Personal-
and their applications. Clinical Psychology Review, ity and Social Psychology, 2, 726–735.
5, 429–445. Ekstrom, R. B., French, J. W., & Harman, H. H.
Educational Testing Service. (1977). Graduate Record (1979). Cognitive factors: Their identification
Examinations technical manual. Princeton, NJ: and replication. Multivariate Behavioral Research,
Author. No. 79–2.
Educational Testing Service. (1994). College Board to Elbert, J. C. (1984). Training in child diagnostic
recenter scorting scale for PSAT/NMSQT and SAT assessment: A survey of clinical psychology grad-
I. ETS Developments, 39, 1–3. uate programs. Journal of Clinical Child Psychology,
Edward, J. E., & Wilkins, W. (1981). Verbalizer- 13, 122–133.
Visualizer Questionnaire: Relationship with Elksnin, L. K., & Elksnin, N. (1993). A review of pic-
imagery and verbal-visual ability. Journal of Mental ture interest inventories: Implications for vocational
Imagery, 5, 137–142. assessment of students with disabilities. Journal of
Edwards, A. J. (1971). Individual Mental Testing. Part Psychoeducational Assessment, 11, 323–336.
I History and theories. Scranton, PA: Intext Educa- Elliot, G. R., & Eisdorfer, C. (1981). Stress and human
tional. health: Analysis and implications of research. New
Edwards, A. L. (1957a). Techniques of attitude scale con- York: Springer-Verlag.
struction. New York: Appleton-Century-Crofts. Elliott, C. D. (1990a). Differential Ability Scales: Intro-
Edwards, A. L. (1957b). The social desirability variable ductory and technical handbook. San Antonio, TX:
in personality assessment and research. New York: Psychological Corporation.
Dryden. Elliott, C. D. (1990b). The nature and structure of chil-
Edwards, A. L. (1959). The Edwards Personal Prefer- dren’s abilities: Evidence from the differential abil-
ence Schedule (Rev. ed.). New York: Psychological ity scales. Journal of Psychoeducational Assessment,
Corporation. 8, 376–390.
Edwards, A. L. (1970). The measurement of personal- Elliott, C. D., & Tyler, S. (1987). Learning disabilities
ity traits by scales and inventories. New York: Holt, and intelligence test results: A principal components
Rinehart and Winston, Inc. analysis of the British Ability Scales. British Journal
Edwards, A. L., & Diers, C. J. (1963). Neutral items as of Psychology, 78, 325–333.
a measure of acquiescence. Educational and Psycho- Ellis, D. (1978). Methods of assessment for use with
logical Measurement, 23, 687–698. the visually and mentally handicapped: A selective
Edwards, A. L., & Kilpatrick, F. P. (1948). A technique review. Childcare, Health and Development, 4, 397–
for the construction of attitude scales. Journal of 410.
Applied Psychology, 32, 374–384. Elman, L., Blixt, S., & Sawacki, R. (1981). The develop-
Edwards, J. R., Baglioni, A. J., Jr., & Cooper, C. L. ment of cutoff scores on a WISC-R in the multidi-
(1990). Examining relationships among self-report mensional assessment of gifted children. Psychology
measures of Type A behavior pattern: The effects in the Schools, 18, 426–428.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 559

Elwood, D. L. (1969). Automation of psychological abnormal mental decline in older persons. Journal
testing. American Psychologist, 24, 287–289. of the American Medical Association, 253, 670–674.
Elwood, D. L., & Griffin, R. H. (1972). Individual intel- Estes, G. D., Harris, J., Moers, F., & Wodrich, D. L.
ligence testing without the examiner: Reliability of (1976). Predictive validity of the Boehm Test of Basic
an automated method. Journal of Consulting and Concepts for achievement in first grade. Educational
Clinical Psychology, 38, 9–14. and Psychological Measurement, 36, 1031–1035.
Elwood, R. W. (1993). Psychological tests and clinical Estes, W. K. (1974). Learning theory and intelligence.
discriminations: Beginning to address the base rate American Psychologist, 29, 740–749.
problem. Clinical Psychology Review, 13, 409–419. Evans, R. B., & Cohen, J. B. (1987). The American Jour-
Embretson, S. (1985a). Review of the British Ability nal of Psychology: A retrospective. American Journal
Scales. In J. V. Mitchell, Jr. (Ed.), The ninth mental of Psychology, 100, 321–362.
measurements yearbook (Vol. 1, pp. 231–232). Lin- Evans, R. G. (1982). Clinical relevance of the Marlowe-
coln, NE: University of Nebraska Press. Crowne Scale: A review and recommendations.
Embretson, S. E. (1985b). Introduction to the problem Journal of Personality Assessment, 46, 415–425.
of test design. In S. E. Embretson (Ed.), Test design Ewedemi, F., & Linn, M. W. (1987). Health and has-
(pp. 3–17). New York: Academic Press. sles in older and younger men. Journal of Clinical
Endler, N. S., & Magnusson, D. (1976). Interactional Psychology, 43, 347–353.
psychology and personality. New York: Wiley & Sons. Exner, J. E., Jr. (1974). The Rorschach: A comprehensive
Endler, N. S., Rutherford, A., & Denisoff, E. (1999). system (Vol. 1). New York: Wiley.
Beck Depression Inventory: Exploring its dimen- Exner, J. E., Jr. (1993). The Rorschach: A comprehen-
sionality in a nonclinical population. Journal of Clin- sive system: Vol. 1, Basic Foundations (3rd ed.). New
ical Psychology, 55, 1307–1312. York: Wiley.
Ensor, A., & Phelps, L. (1989). Gender differences Exner, J. E., Jr. (1999). The Rorschach: Measurement
on the WAIS-R Performance Scale with young deaf concepts and issues of validity. In S. E. Embretson &
adults. Journal of the American Deafness and Reha- S. L. Hershberger (Eds.), The new rules of measure-
bilitation Association, 22, 48–52. ment. Mahwah, NJ: Erlbaum.
Epstein, N., Baldwin, L., & Bishop, D. (1983). The Exner, J. E., McDowell, E., Pabst, J., Stackman, W., &
McMaster family assessment device. Journal of Mar- Kirk, L. (1963). On the detection of willful falsifica-
ital and Family Therapy, 9, 171–180. tions in the MMPI. Journal of Consulting Psychology,
Epstein, S. (1979). The stability of behavior. I. On pre- 27, 91–94.
dicting most of the people much of the time. Journal Eyde, L. D., Kowal, D. M., & Fishburne, F. J., Jr. (1991).
of Personality and Social Psychology, 37, 1097–1126. The validity of computer-based test interpretations
Erickson, R. C., Post, R., & Paige, A. (1975). Hope as of the MMPI. In T. B. Gutkin & S. L. Wise (Eds.), The
a psychiatric variable. Journal of Clinical Psychology, computer and the decision-making process (pp. 75–
31, 324–329. 123). Hillsdale, NJ: Lawrence Erlbaum.
Erickson, R. C., & Scott, M. (1977). Clinical memory Eysenck, H. J. (1985). Review of the California Psy-
testing: A review. Psychological Bulletin, 84, 1130– chological Inventory. In J. V. Mitchell, Jr. (Ed.),
1149. The ninth mental measurements yearbook (Vol. 1,
Erikson, E. H. (1963). Childhood and society (2nd ed.). pp. 252–253). Lincoln, NE: University of Nebraska
New York: Norton. Press.
Erikson, E. H. (1980). Identity and the life cycle. New Eysenck, H. J., & Eysenck, S. B. G. (1975). The Eysenck
York: Norton. Personality Questionnaire manual. London: Hodder
Erikson, E. H. (1982). The life cycle completed: A review. & Stoughton.
New York: Norton. Fabian, L., & Ross, M. (1984). The development of
Eron, L. D. (1950). A normative study of the The- the Sports Competition Trait Inventory. Journal of
matic Apperception Test. Psychological Monographs, Sports Behavior, 7, 13–27.
64 (No. 9). Fagley, N. S. (1987). Positional response bias in
Eron, L. D. (1955). Some problems in the research multiple-choice tests of learning: Its relation to test-
application of the Thematic Apperception Test. Jour- wiseness and guessing strategy. Journal of Educa-
nal of Projective Techniques, 19, 125–129. tional Psychology, 79, 95–97.
Eron, L. D., Terry, D., & Callahan, R. (1950). The use Fancher, R. E. (1987). Henry Goddard and the Kallikak
of rating scales for emotional tone of TAT stories. family photographs. American Psychologist, 42,
Journal of Consulting Psychology, 14, 473–478. 585–590.
Eslinger, P. J., Damasio, A. R., Benton, A. L., & Van Fancher, R. E. (1983). Alphonse de Candolle, Francis
Allen, M. (1985). Neuropsychologic detection of Galton, and the early history of the nature-nature
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

560 References

controversy. Journal of the History of the Behavioral (Vol. 1, pp. 34–40). Kansas City, MO: Test Corpora-
Sciences, 19, 341–352. tion of America.
Farh, J. L., Dobbins, G. H., & Cheng, B. S. (1991). Feldman, K. A., & Newcomb, T. M. (1969). The impact
Cultural relativity in action: A comparison of self- of colleges on students: Vol. 1. An analysis of four
ratings made by Chinese and U.S. workers. Personnel decades of research. San Francisco, CA: Jossey-Bass.
Psychology, 44, 129–147. Feldman, W. (1990). Learning disabilities: A review
Farmer, R., & Sundberg, N. D. (1986). Boredom of available treatments. Springfield, IL: Charles C
Proneness – The development and correlates of Thomas.
a new scale. Journal of Personality Assessment, 50, Fenigstein, A., Scheier, M. F., & Buss, A. H. (1975).
4–17. Public and private self-consciousness: Assessment
Farrugia, D. (1983). A study of vocational interests and theory. Journal of Consulting and Clinical Psy-
and attitudes of hearing impaired clients. Journal chology, 43, 522–527.
of Rehabilitation of the Deaf, 17, 1–7. Ferketich, S. L., Figueredo, A. J., & Knapp, T. R. (1991).
Farrugia, D., & Austin, G. F. (1980). A study of social- The multitrait-multimethod approach to construct
emotional adjustment patterns of hearing-impaired validity. Research in Nursing and Health, 14, 315–
students in different educational settings. American 320.
Annals of the Deaf, 125, 535–541. Ferraro, L. A., Price, J. H., Desmond, S. M., & Roberts,
Farylo, B., & Paludi, M. A. (1985). Research with the S. M. (1987). Development of a diabetes locus of
Draw-A-Person Test: Conceptual and methodolog- control scale. Psychological Reports, 61, 763–770.
ical issues. The Journal of Psychology, 119, 575–580. Festinger, L. (1947). The treatment of qualitative
Faust, D., Hart, K., & Guilmette, T. J. (1988). Pediatric data by “scale analysis.” Psychological Bulletin, 44,
malingering: The capacity of children to fake believ- 149–161.
able deficits on neuropsychological testing. Journal Feuerstein, R. (1979). The dynamic assessment of
of Consulting and Clinical Psychology, 56, 578–582. retarded performers: The learning potential assess-
Faust, D., & Ziskin, J. (1989). Computer-assisted psy- ment device, theory, instruments, and techniques. Bal-
chological evaluation as legal evidence: Some day my timore, MD: University of Maryland Press.
prints will come. Computers in Human Behavior, 5, Fibel, B., & Hale, W. D. (1978). The generalized
23–36. expectancy for success scale – A new measure. Jour-
Fazio, A. F. (1969). Verbal and overt-behavioral assess- nal of Consulting and Clinical Psychology, 46, 924–
ment of a specific fear. Journal of Consulting and 931.
Clinical Psychology, 33, 703–709. Figueroa, R. A. (1979). The system of Multicultural
Feather, N. T. (1975). Values in education and society. Pluralistic Assessment. The School Psychology Digest,
New York: Free Press. 8, 28–36.
Feather, N. T. (1986). Cross-cultural studies with the Filskov, S. B. (1983). Neuropsychological screening.
Rokeach Value Survey: The Flinders program of In P. Q. Keller & L. G. Writt (Eds.), Innova-
research on values. Australian Journal of Psychology, tions in clinical practice: A sourcebook (Vol. II,
38, 269–283. pp. 17–25). Sarasota, FL: Professional Resource
Feather, N. T., & Peay, E. R. (1975). The structure of Exchange.
terminal and instrumental values: Dimensions and Finch, A. J., & Rogers, T. R. (1984). Self-report instru-
clusters. Australian Journal of Psychology, 27, 151– ments. In T. H. Ollendick & M. Hersen (Eds.), Child
164. behavioral assessment (pp. 106–123), New York:
Feazell, D. M., Quay, H. C, & Murray, E. J. Pergamon Press.
(1991). The validity and utility of Lanyon’s Psy- Finch, A. J., Thornton, L. J., & Montgomery, L. E.
chological Screening Inventory in a youth services (1974). WAIS short forms with hospitalized psy-
agency sample. Criminal Justice and Behavior, 18, chiatric patients. Journal of Consulting and Clinical
166–179. Psychology, 42, 469.
Federici, L., & Schuerger, J. (1974). Prediction of suc- Finegan, J. E., & Allen, N. J. (1994). Computerized and
cess in an applied M. A. psychology program. Educa- written questionnaires: Are they equivalent? Com-
tional and Psychological Measurement, 34, 945–952. puters in Human Behavior, 10, 483–496.
Feingold, A. (1983). The validity of the Information Finger, M. S., & Ones, D. S. (1999). Psychometric
and Vocabulary Subtests of the WAIS for predicting equivalence of the computer and booklet forms of
college achievement. Educational and Psychological the MMPI: A metaanalysis. Psychological Assessment,
Measurement, 43, 1127–1131. 11, 58–66.
Fekken. G. C. (1984). Adjective Check List. In D. J. Fischer, D. G., & Fick, C. (1993). Measuring social
Keyser & R. C. Sweetland (Eds.), Test critiques desirability – Short forms of the Marlowe-Crowne
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

References 561

Social Desirability scale. Educational and Psycholog- Fluckiger, F. A., Tripp, C. A., & Weinberg, G. H. (1961).
ical Measurement, 53, 417–424. A review of experimental research in graphology:
Fishbein, M. (1980). A theory of reasoned action: 1933–1960. Perceptual and Motor Skills, 12,
Some applications and implications. In M. M. Page 67–90.
(Ed.), Nebraska Symposium on Motivation, 1979 Folkins, C. H., & Sime, W. E. (1981). Physical fitness
(pp. 65–115). Lincoln, NE: University of Nebraska training and mental health. American Psychologist,
Press. 36, 373–389.
Fisher, S., & Fisher, R. (1950). Test of certain assump- Folstein, M. D., Folstein, S. E., & McHugh, P. R. (1975).
tions regarding figure drawing analysis. Journal of Mini-mental state: A practical method for grading
Abnormal and Social Psychology, 45, 727–732. the cognitive state of patients for the clinician. Jour-
Fitts, W. H. (1965). Manual for the Tennessee Self- nal of Psychiatric Research, 12, 189–198.
Concept Scale. Nashville, TN: Counselor Recordings Forde, J. (1977). Data on the Peabody Picture Vocab-
and Tests. ulary Test. American Annals of the Deaf, 122, 38–43.
Flaharty, R. (1976). EPEC: Evaluation and Prescription Fosberg, I. A. (1938). Rorschach reactions under varied
for Exceptional Children. In E. R. Ritvo, B. J. Free- instructions. Rorschach Research Exchange, 3, 12–30.
man, E. M. Ornitz, & P. E. Tanguay (Eds.), Autism: Fosberg, I. A. (1941). An experimental study of the
Diagnosis, current research and management. New reliability of the Rorschach psychodiagnostic tech-
York: Spectrum. nique. Rorschach Research Exchange, 5, 72–84.
Flanagan, J., & Lewis, G. (1969). Comparison of Negro Fosberg, I. A. (1943). How do subjects attempt to fake
and white lower class men on the General Aptitude results on the Rorschach test? Rorschach Research
Test Battery and the Minnesota Multiphasic Person- Exchange, 7, 119–121.
ality Inventory. Journal of Social Psychology, 78, 289– Fowler, P. C. (1985). Factor structure of the Personality
291. Research Form-E: A maximum likelihood analysis.
Flanagan, R., & DiGiuseppe, R. (1999). Critical review Journal of Clinical Psychology, 41, 377–381.
of the TEMAS: A step within the development of Fowler, R. D. (1967). Computer interpretation of per-
thematic apperception instruments. Psychology in sonality tests: The automated psychologist. Compre-
the Schools, 36, 21–30. hensive Psychiatry, 8, 455–467.
Fleege, P. O., Charlesworth, R., Burts, D. C., & Hart, Fowler, R. D. (1985). Landmarks in computer-assisted
C. H. (1992). Stress begins in kindergarten: A look psychological assessment. Journal of Counseling and
at behavior during standardized testing. Journal of Clinical Psychology, 53, 748–759.
Research in Childhood Education, 7, 20–26. Fraboni, M., & Saltstone, R. (1992). The WAIS-R
Fleenor, J. W., & Taylor, S. (1994). Construct validity of number-of-factors quandary: A cluster analytic
three self-report measures of creativity. Educational approach to construct validation. Educational and
and Psychological Measurement, 54, 464–470. Psychological Measurement, 52, 603–613.
Fleishman, E. A. (1988). Some new frontiers in per- Fraiberg, S. (1977). Thoughts from the blind: Devel-
sonnel selection research. Personnel Psychology, 41, opmental studies of blind children. New York: Basic
679–702. Books.
Fleiss, J. L. (1971). Measuring nominal scale agree- Franco, J. N. (1983). An acculturation scale for
ment among many raters. Psychological Bulletin, 76, Mexican-American children. Journal of General Psy-
378–382. chology, 108, 175–181.
Fleiss, J. L. (1975). Measuring agreement between two Frank, G. (1983). The Wechsler enterprise. Oxford:
judges on the presence or absence of a trait. Biomet- Pergamon Press.
rics, 31, 651–659. Frank, G. H. (1970). The measurement of personal-
Fleming, J., & Garcia, N. (1998). Are standardized tests ity from the Wechsler tests. In B. A. Mahrer (Ed.),
fair to African Americans? The Journal of Higher Progress in experimental personality research. New
Education, 69, 471–495. York: Academic Press.
Fleming, T. M., Perkins, D. V., & Lovejoy, M. C. (1991). Frank, J. (1968). The role of hope in psychotherapy.
Development of a brief self-report measure of qual- International Journal of Psychiatry, 6, 383–395.
ity in roommate relationships. Behavioral Assess- Frankenburg, W. K., & Dodds, J. B. (1967). The Den-
ment, 13, 125–135. ver Developmental Screening Test. The Journal of
Fletcher, T. (1989). A comparison of the Mexican ver- Pediatrics, 71, 181–191.
sion of the Wechsler Intelligence Scale for Children- Franzen, M. D. (1987). Luria-Nebraska Neuropsycho-
Revised and the Woodcock Psycho-Educational logical Battery. In D. J. Keyser & R. C. Sweetland
Battery in Spanish. Journal of Psychoeducational (Eds.), Test critiques compendium (pp. 260–272).
Assessment, 7, 56–65. Kansas City, MO: Test Corporation of America.
P1: JZP
0521861810rfa1 CB1038/Domino 0 521 86181 0 March 6, 2006 13:33

562 References

Franzen, M. D. (1989). Reliability and validity in neu- Fullard, W., McDevitt, S. C., & Carey, W. B. (1984).
ropsychological assessment. New York: Plenum Press. Assessing temperament in one- to three-year-
Fraser, B. J. (1981). Using environmental assessments old children. Journal of Pediatric Psychology, 9,
to make better classrooms. Journal of Curriculum 205–217.
Studies, 13, 131–144. Funk, S. C., & Houston, B. K. (1987). A critical analysis
Frederick, R. I., Sarfaty, S. D., Johnston, D., & Powel, J. of the hardiness scale’s validity and utility. Journal
(1994). Validation of a detector of response bias on of Personality and Social Psychology, 53, 572–578.
a forced-choice test of nonverbal ability. Neuropsy- Furey, W., & Forehand, R. (1983). The Daily Child
chology, 8, 118–125. Behavior Checklist. Journal of Behavioral Assess-
Frederiksen, N. (1962). Factors in in-basket perfor- ment, 5, 83–95.
mance. Psychological Monographs, 76 (22, whole No. Furnham, A. (1986). Response bias, social desirability
541). and dissimulation. Personality and Individual Dif-
Freedman, D. G., & Freedman, N. (1969). Behav- ferences, 7, 385–400.
ioral differences between Chinese-American Furnham, A. (1987). A content and correlational anal-
and European-American newborns. Nature, 24, ysis of seven locus of control scales. Current Psycho-
1227. logical Research and Reviews, 6, 244–255.
Freeman, B. J. (1976). Evaluating autistic children. Furnham, A., & Henderson, M. (1982). The good, the
Journal of Pediatric Psychology, 1, 18–21. bad and the mad: Response bias in self-report mea-
Freeman, B. J. (1985). Review of Child Behavior Check- sures. Personality and Individual Differences, 3, 311–
list. In J. V. Mitchell, Jr. (Ed.), The ninth mental mea- 320.
surements yearbook (Vol. 1, pp. 300–301). Lincoln, Furnham, A., & Henry, J. (1980). Cross-cultural locus
NE: University of Nebraska Press. of control studies: Experiment and critique. Psycho-
Freeman, F. S. (1984). A note on E. B. Titchener and logical Reports, 47, 23–29.
G. M. Whipple. Journal of the History of the Behav- Gaddes, W. H. (1994). Learning disabilities and brain
ioral Sciences, 20, 177–179. function: A neuropsychological approach (3rd ed.).
French, J. (1958). Validation of new item types against New York: Springer-Verlag.
four-year academic criteria. Journal of Educational Gage, N. L. (1959). Review of the Study of Values. In
Psychology, 49, 67–76. O. K. Buros (Ed.), The fifth mental measurements
French, J. L. (1964). Manual: Pictorial Test of Intelli- yearbook (pp. 199–202). Highland Park, NJ: The
gence. Boston, MA: Houghton, Mifflin. Gryphon Press.
Frenzel, R. R., & Lang, R. A. (1989). Identifying sex- Gage, N. L., Leavitt, G. S., & Stone, G. C. (1957). The
ual preferences in intrafamilial and extrafamilial psychological meaning of acquiescence for author-
child sexual abusers. Annals of Sex Research, 2, itarianism. Journal of Abnormal and Social Psychol-
255–275. ogy, 55, 98–103.
Freund, K. (1965). Diagnosing heterosexual pedophi- Gagnon, S. G., & Nagle, R. J. (2000). Comparison of
lia by means of a test for sexual interest. Behavioral the revised and original versions of the Bayley Scales
Research and Therapy, 3, 229–234. of Infant Development. School Psychology Interna-
Fricke, B. G. (1956). Response set as a suppressor vari- tional, 21, 293–305.
able in the OAIS and the MMPI. Journal of Consult- Galassi, J. P., & Galassi, M. D. (1974). Validity of a
ing Psychology, 20, 161–169. measure of assertiveness. Journal of Counseling Psy-
Friedman, M., & Rosenman, R. H. (1959). Associa- chology, 21, 248–250.
tion of specific overt behavior pattern with blood Galen, R., & Gambino, S. (1975). Beyond normality:
and cardiovascular findings. Journal of the Ameri- The predictive value and efficiency of medical diag-
can Medical Association, 169, 1286–1296. noses. New York: Wiley.
Frisbie, D. A., & Becker, D. F. (1991). An analysis of Gallagher, D., Thompson, L. W., & Levy, S. M. (1980).
textbook advice about true-false tests. Applied Mea- Clinical psychological assessment of older adults.
surement in Education, 4, 67–83. In L. W. Poon (Ed.), Aging in the 1980s (pp. 19–
Frostig, M., Lefever, W., & Whittlesey, J. R. B. (1966). 40). Washington, DC: American Psychological
Administration and scoring manual for the Marianne Association.
Frostig Developmental Test of Visual Perception. Palo Gallagher, J. J. (1966). Research summary on gifted child
Alto, CA: Consulting Psychologists Press. education. Springfield, IL: Office of the Superinten-
Fruzzetti, A. E., & Jacobson, N. S. (1992). Assess- dent of Public Instruction.
ment of couples. In J. C. Rosen, & P. McReynolds Gallagher, J. J. (1989). A new policy initiative: Infants
(Eds.), Advances in Psychological Assessment (Vol. 8, and toddlers with handicapping conditions. Amer-
pp. 201–224). New York: Plenum Press. ican Psychologist, 44, 387–391.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 563

Galton, F. (1884). Measurement of character. Fort- revision for 9-to 12-year-old learning-disabled chil-
nightly Review, 36, 179–185. dren. School Psychology Review, 13, 375–380.
Galton, F. (1907). Inquiries into human faculty and its Geer, J. H. (1965). The development of a scale to mea-
development (2nd ed.). New York: E. P. Dutton. sure fear. Behaviour Research and Therapy, 3, 45–53.
Ganster, D. C., Hennessey, H. W., & Luthans, F. (1983). Geisinger, K. F. (1992). The metamorphosis of test val-
Social desirability response effects: Three alternative idation. Educational Psychologist, 27, 197–222.
models. Academy of Management Journal, 26, 321– Geist, H. (1959). Geist Picture Interest Inventory. Los
331. Angeles, CA: Western Psychological Services.
Garb, H. N. (1984). The incremental validity of infor- Geist, H. (1988). The Geist Picture Interest Inventory.
mation used in personality assessment. Clinical Psy- Revised. Los Angeles, CA: Western Psychological
chology Review, 4, 641–655. Service.
Garb, H. N., Florio, C. M., & Grove, W. M. Gelatt, H. B. (1967). Information and decision theories
(1998). The validity of the Rorschach and applied to college choice and planning. In Preparing
the Minnesota Multiphasic Personality Inventory: school counselors in educational guidance. New York:
Results from meta-analyses. Psychological Science, 9, College Entrance Examination Board.
402–404. Gelb, S. A. (1986). Henry H. Goddard and the immi-
Garcia, M., & Lega, L. I. (1979). Development of grants, 1910–1917: The studies and their social con-
a Cuban Ethnic Identity Questionnaire. Hispanic text. Journal of the History of the Behavioral Sciences,
Journal of Behavioral Sciences, 1, 247–261. 22, 324–332.
Gardner, H. (1983). Frames of mind: The theory of mul- Genshaft, J., & Ward, M. E. (1982). A review of the
tiple intelligences. New York: Basic. Perkins-Binet tests of intelligence for the blind with
Gardner, M. (1974). Mathematical games. Scientific suggestions for administration. School Psychology
American, 231, 98–103. Review, 11, 338–341.
Garner, D. M. (1991). The Eating Disorder Inventory- Gerardi, R., Keane, T. M., & Penk, W. (1989). Utility:
2 professional manual. Odessa, FL: Psychological Sensitivity and specificity in developing diagnostic
Assessment Resources. tests of combat-related post-traumatic stress disor-
Garner, D. M., & Olmsted, M. P. (1984). The Eating der (PTSD). Journal of Clinical Psychology, 45, 691–
Disorder Inventory Manual. Odessa, FL: Psycholog- 703.
ical Assessment Resources. Gerken, K. C. (1978). Performance of Mexican Amer-
Garry, R. (1953). Individual differences in ability to ican children on intelligence tests. Exceptional chil-
fake vocational interests. Journal of Applied Psychol- dren, 44, 438–443.
ogy, 37, 33–37. Gerken, K. C., & Hodapp, A. F. (1992). Assessment
Garth, T. R., Eson, T. H., & Morton, M. M. (1936). The of preschoolers at-risk with the WPPSI-R and the
administration of non-language intelligence tests to Stanford-Binet L-M. Psychological Reports, 71, 659–
Mexicans. Journal of Abnormal and Social Psychol- 664.
ogy, 31, 53–58. Gerst, M. S., Grant, I., Yager, J., & Sweetwood, H.
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C. III, (1978). The reliability of the social readjustment rat-
& Bentson, C. (1987). Meta-analysis of assessment ing scale: Moderate and long-term stability. Journal
center validity. Journal of Applied Psychology, 72, of Psychosmatic Research, 22, 519–523.
493–511. Gesell, A., Halverson, H. M., & Amatruda, C. S. (1940).
Gay, M. L., Hollandsworth, J. G., Jr., & Galassi, J. P. The first five years of life. New York: Harper.
(1975). An assertiveness inventory for adults. Jour- Gesell, A., Ilg, F. L., & Ames, L. B. (1974). Infant and
nal of Counseling Psychology, 22, 340–344. child in the culture of today (Rev. ed.). New York:
Gear, G. H. (1976). Accuracy of teacher judg- Harper & Row.
ment in identifying intellectually gifted children: A Geschwind, N., & Levitsky, W. (1968). Human brain:
review of the literature. Gifted Child Quarterly, 20, Left-right asymmetrics in temporal speech regions.
478–490. Science, 161, 186–187.
Geary, D. C., & Gilger, J. W. (1984). The Luria- Getter, H., & Sundland, D. M. (1962). The Barron Ego
Nebraska Neuropsychological Battery – Children’s Strength Scale and psychotherapy outcome. Journal
revision: Comparison of learning-disabled and nor- of Consulting Psychology, 26, 195.
mal children matched on full scale IQ. Perceptual Ghiselli, E. E. (1966). The validity of occupational apti-
and Motor Skills, 58, 115–118. tude tests. New York: Wiley.
Geary, D. C., Jennings, S. M., Schultz, D. D., & Alper, Ghiselli, E. E. (1973). The validity of aptitude tests in
T. G. (1984). The diagnostic accuracy of the Luria- personnel selection. Personnel Psychology, 26, 461–
Nebraska Neuropsychological Battery – Children’s 477.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

564 References

Ghiselli, E. E., & Haire, M. (1960). The validation of Glaser, R. (1963). Instructional technology and the
selection tests in the light of the dynamic char- measurement of learning outcomes. American Psy-
acter of criteria. Personnel Psychology, 13, 225– chologist, 18, 519–521.
231. Glass, D. C. (1977). Stress, behavior patterns, and coro-
Gibb, B. (1964). Testwiseness as secondary cue nary disease. American Scientist, 65, 177–187.
response (Doctoral dissertation, Stanford Univer- Glass, G. V. (1976). Primary, secondary, and meta-
sity, 1964). Ann Arbor, MI: University Microfilms analysis of research. Educational Researcher, 5, 3–8.
(UMI No. 64-7643). Glaub, V. E., & Kamphaus, R. W. (1991). Construc-
Gibbins. K. (1968). Response sets and the semantic dif- tion of a nonverbal adaptation of the Stanford-Binet
ferential. British Journal of Social and Clinical Psy- fourth edition. Educational and Psychological Mea-
chology, 7, 253–263. surement, 51, 231–241.
Gibbins, S. (1989). The provision of school psy- Glennon, J. R., Albright, K. E., & Owens, W. A. (1966).
chological assessment services for the hearing A catalog of life history items. Greensboro, NC: Cen-
impaired: A national survey. The Volta Review, 91, ter for Creative Leadership.
95–103. Glow, R. A., Glow, P. A., & Rump, E. E. (1982). The
Gibson, J. J. (1966). The senses considered as perceptual stability of child behavior disorders: A one year test-
systems. New York: Houghton Mifflin. retest study of Adelaide versions of the Conners
Gibson-Harman, K., & Austin, G. F. (1985). A revised Teacher and Parent Rating Scales. Journal of Abnor-
form of the Tennessee Self-concept Scale for use with mal Child Psychology, 10, 33–60.
deaf and hard of hearing persons. American Annals Glueck, B. C., & Reznikoff, M. (1965). Comparison
of the Deaf, 130, 218–225. of computer-derived personality profile and projec-
Giedt, F. H., & Downing, L. (1961). An extraversion tive psychological test findings. American Journal of
scale for the MMPI. Journal of Clinical Psychology, Psychiatry, 121, 1156–1161.
17, 156–159. Glutting, J. J., & McDermott, P. A. (1989). Using
Gierl, M. J., & Rogers, W. T. (1996). A confirma- “teaching items” on ability tests: A nice idea but
tory factor analysis of the Test Anxiety Inventory does it work? Educational and Psychological Mea-
using Canadian high-school students. Educational surement, 49, 257–268.
and Psychological Measurement, 56, 315–324. Goddard, H. H. (1913). The Binet tests in relation to
Gilberstadt, H., & Duker, J. (1965). A handbook for immigration. Journal of Psycho-Asthenics, 18, 105–
clinical and actuarial MMPI interpretation. Philadel- 107.
phia, PA: W. B. Saunders. Goddard, H. H. (1917). Mental tests and the immi-
Gilewski, M. J., & Zelinski, E. M. (1986). Question- grant. Journal of Delinquency, 2, 243–277.
naire assessment of memory complaints. In L. W. Goebel, R. A. (1983). Detection of faking on the
Poon, T. Crook, K. L. Davis, C. Eisdorfer, B. J. Gur- Halstead-Reitan Neuropsychological Test Battery.
land, A. W. Kaszniak, & L. W. Thompson (Eds.), Journal of Clinical Psychology, 39, 731–742.
Clinical memory assessment of older adults (pp. 93– Goetzinger, C. P., Wills, R. C., & Dekker, L. C. (1967).
107). Washington, DC: American Psychological Non-language I.Q. tests used with deaf pupils. Volta
Association. Review, 69, 500–506.
Gilewski, M. J., Zelinski, E. M., & Schaie, K. W. (1990). Goldband, S., Katkin, E. S., & Morell, M. A. (1979). Per-
The Memory Functioning Questionnaire for assess- sonality and cardiovascular disorder: Steps toward
ment of memory complaints in adulthood and old demystification. In I. G. Sarason & C. D. Spielberger
age. Psychology and Aging, 5, 482–490. (Eds.), Stress and anxiety (Vol. 6). New York: Wiley.
Gill, D. L., & Deeter, T. E. (1988). Development of the Goldberg, E. L., & Alliger, G. M. (1992). Assessing the
Sports Orientation Questionnaire. Research Quar- validity of the GRE for students in Psychology: A
terly for Exercise and Sport, 59, 191–202. validity generalization approach. Educational and
Gillberg, I. C., & Gillberg, I. C. (1983). Three-year Psychological Measurement, 52, 1019–1027.
follow-up at age 10 of children with minor neu- Goldberg, L. R. (1959). The effectiveness of clinicians’
rodevelopmental disorders. I. Behavioral problems. judgments: The diagnosis of organic brain damage
Developmental Medicine and Child Neurology, 25, from the Bender-Gestalt test. Journal of Consulting
438–449. Psychology, 23, 25–33.
Gillespie, B. L., Eisler, R. M. (1992). Development of Goldberg, L. R. (1968). Simple models or simple
the femining gender role stress scale. Behavior Mod- processes? Some research on clinical judgments.
ification, 16, 426–438. American Psychologist, 23, 483–496.
Ginzberg, E. (1951). Occupational choice. New York: Goldberg, L. R. (1972). Review of the CPI. In O. K.
Columbia University Press. Buros (Ed.), The seventh mental measurements
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 565

yearbook (Vol. 1, pp. 94–96). Highland Park, NJ: Goldman, M. E., & Berry, C. A. (1981). Comparative
Gryphon Press. predictive validity of the new MCAT using different
Goldberg, L. R. (1974). Objective personality tests and admissions criteria. Journal of Medical Education,
measures. Annual Review of Psychology, 25, 343–366. 56, 981–986.
Goldberg, L. R. (1981). Language and individual Goldman, R. D., & Hewitt, B. N. (1976). Predicting the
differences: The search for universals in person- success of Black, Chicano, Oriental, and White col-
ality lexicons. In L. Wheeler (Ed.), Personality lege students. Journal of Educational Measurement,
and social psychology review (Vol. 2, pp. 141–165). 13, 107–117.
Beverly Hills, CA: Sage. Goldman, R. D., & Richards, R. (1974). The SAT
Goldberg, L. R. (1990). An alternative “Description of prediction of grades for Mexican-American versus
Personality”: The Big-Five Factor Structure. Journal Anglo-American students at the University of Cal-
of Personality and Social Psychology, 59, 1216–1229. ifornia, Riverside, Journal of Educational Measure-
Goldberg, L. R., Grenier, J. R., Guion, R. M., ment, 11, 129–135.
Sechrest, L. B., & Wing, H. (1991). Questionnaires Goldman, R. D., & Slaughter, R. E. (1976). Why college
used in the prediction of trustworthiness in pre- grade point average is difficult to predict. Journal of
employment selection decisions: An A.P.A. Task Force Educational Psychology, 68, 9–14.
Report. Washington DC: American Psychological Goldman, R. D., & Widawski, M. H. (1976). An analysis
Association. of types of errors in the selection of minority college
Goldberg, L. R., Rorer, L. G., & Greene, M. M. (1970). students. Journal of Educational Measurement, 13,
The usefulness of “stylistic” scales as potential sup- 185–200.
pressor or moderator variables in predictions from Goldman, R. L. (1992). The reliability of peer assess-
the CPI. Oregon Research Institute Research Bulletin, ments of quality of care. Journal of the American
10, No. 3. Medical Association, 267, 958–960.
Goldberg, P. (1965). A review of sentence completion Goldman, R. L. (1994). The reliability of peer assess-
methods in personality assessment. In B. I. Murstein ments. A metaanalysis. Evaluation and the Health
(Ed.), Handbook of projective techniques (pp. 777– Professions, 17, 3–21.
822). New York: Basic Books. Goldschmidt, M. L., & Bentler, P. M. (1968). Manual:
Golden, C. J. (1981). The Luria-Nebraska Children’s Concept assessment kit: Conservation. San Diego:
Battery: Theory and formulation. In G. W. Hynd, Educational & Industrial Testing Service.
& J. E. Obrzut (Eds.), Neuropsychological assess- Goldsmith, D. B. (1922). The use of the personnel his-
ment and the school-age child. New York: Grune & tory blank as a salesmanship test. Journal of Applied
Stratton. Psychology, 6, 149–155.
Golden, C. J. (1989). The Nebraska Neuropsycho- Goldstein, G. (1984). Comprehensive neuropsy-
logical Children’s Battery. In C. R. Reynolds & chological assessment batteries. In G. Goldstein
E. Fletcher-Janzen (Eds.), Handbook of clinical & M. Hersen (Eds.), Handbook of psycholog-
child neuropsychology (pp. 193–204). New York: ical assessment (pp. 181–210). Elmsford, NY:
Plenum. Pergamon Press.
Golden, C. J., Purisch, A. D., & Hammeke, T. A. Goldstein, G., & Shelly, C. H. (1975). Similarities and
(1985). Luria-Nebraska Neuropsychological Battery: differences between psychological deficit in aging
Forms I and II. Los Angeles: Western Psychological and brain damage. Journal of Gerontology, 30, 448–
Services. 455.
Goldfried, M. R. (1964). A cross-validation of the Goldstein, I. L. (1971). The application blank: How
Marlowe-Crowne Social Desirability Scale items. honest are the responses? Journal of Applied Psychol-
Journal of Social Psychology, 64, 137–145. ogy, 55, 491–492.
Goldfried, M. R., & Kent, R. N. (1972). Traditional Good, L. R., & Good, K. C. (1974). A preliminary mea-
vs. behavioral personality assessment: A compari- sure of existential anxiety. Psychological Reports, 34,
son of methodological and theoretical assumptions. 72–74.
Psychological Bulletin, 77, 409–420. Good, R. H., III, Vollmer, M., Creek, R. J., Katz, L.,
Goldfried, M. R., & Zax, M. (1965). The stimulus value & Chowdhri, S. (1993). Treatment utility of the
of the TAT. Journal of Projective Techniques, 29, 46– Kaufman Assessment Battery for Children: Effects
57. of matching instruction and student processing
Golding, S. L. (1978). Review of the Psychological strength. School Psychology Review, 22, 8–26.
Screening Inventory. In O. K. Buros (Ed.), The eighth Goodenough, F. L. (1926). Measurement of intelli-
mental measurements yearbook (Vol. 1, pp. 1019– gence by drawings. New York: Harcourt, Brace, &
1022). Highland Park, NJ: Gryphon Press. World.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

566 References

Goodwin, F. K., & Jamison, K. R. (1990). Manic- Gough, H. G. (1954). Some common misconceptions
depressive illness. New York: Oxford University Press. about neuroticism. Journal of Consulting Psychology,
Goodwin, W. L., & Driscoll, L. A. (1980). Handbook 18, 287–292.
for measurement and evaluation in early childhood Gough, H. G. (1965). Conceptual analysis of psycho-
education. San Francisco, CA: Jossey-Bass. logical test scores and other diagnostic variables.
Gordon, M. E., & Gross, R. H. (1978). A critique of Journal of Abnormal Psychology, 70, 294–302.
methods for operationalizing the concept of fake- Gough, H. G. (1966). Appraisal of social maturity by
ability. Educational and Psychological Measurement, means of the CPI. Journal of Abnormal Psychology,
38, 771–782. 71, 189–195.
Gorenstein, C., Andrade, L., Filho, A., Tung, T., & Gough, H. G. (1968). An interpreter’s syllabus
Artes, R. (1999). Psychometric properties of the Por- for the California Psychological Inventory. In P.
tuguese version of the Beck Depression Inventory on McReynolds (Ed.), Advances in psychological assess-
Brazilian college students. Journal of Clinical Psy- ment (Vol. 1, pp. 55–79). Palo Alto, CA: Science and
chology, 55, 553–562. Behavior Books.
Gorham, D. R. (1967). Validity and reliability stud- Gough, H. G. (1974). A 24-item version of the Miller-
ies of a computer-based scoring system for inkblot Fisk sexual knowledge questionnaire. The Journal of
responses. Journal of Consulting Psychology, 31, 65– Psychology 87, 183–192.
70. Gough, H. G. (1984). A managerial potential scale for
Gorham, D. R., Moseley, E. C., & Holtzman, W. W. the California Psychological Inventory. Journal of
(1968). Norms for the computer-scored Holtzman Applied Psychology, 69, 233–240.
Inkblot Technique. Perceptual and Motor Skills, 26, Gough, H. G. (1985). A Work Orientation scale for
1279–1305. the California Psychological Inventory. Journal of
Gorman, B. (1968). Social desirability factors and the Applied Psychology, 70, 505–513.
Eysenck Personality Inventory. Journal of Psychology, Gough, H. G. (1987). CPI administrator’s guide. Palo
69, 75–83. Alto, CA: Consulting Psychologists Press.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Gough, H. G. (1989). The California Psychological
Hillsdale, NJ: Lawrence Erlbaum. Inventory. In C. S. Newmark (Ed.), Major psycho-
Gorsuch, R. L., & Key, M. K. (1974). Abnormalities of logical assessment instruments (Vol. 2, pp. 67–98).
pregnancy as a function of anxiety and life stress. Boston, MA: Allyn & Bacon.
Psychosomatic Medicine, 36, 352–372. Gough, H. G. (1991). Some unfinished business. In
Gotlib, I. H. (1984). Depression and general psy- W. M. Grove & D. Cicchetti (Eds.), Thinking clearly
chopathology in university students. Journal of about psychology (Vol. 2, pp. 114–136) Personality
Abnormal Psychology, 93, 19–30. and psychopathology. Minneapolis, MN: University
Gottesman, I. I., & Prescott, C. A. (1989). Abuses of of Minnesota Press.
the MacAndrew MMPI Alcoholism Scale: A critical Gough, H. G. (1992). Assessment of creative potential
review. Clinical Psychology Review, 9, 223–242. in psychology and the development of a Creative
Gottfredson, L. S., & Crouse, J. (1986). Validity versus Temperament Scale for the CPI. In J. C. Rosen & P.
utility of mental tests: Example of the SAT. Journal McReynolds (Eds.), Advances in psychological assess-
of Vocational Behavior, 29, 363–378. ment (Vol. 8, pp. 225–257). New York: Plenum.
Gottschalk, L. A. (1974). A Hope Scale applicable to Gough, H. G., & Bradley, P. (1992). Delinquent and
verbal samples. Archives of General Psychiatry, 30, criminal behavior as assessed by the revised Cali-
779–785. fornia Psychological Inventory. Journal of Clinical
Gottschalk, L. (1985). Hope and other deterrents to Psychology, 48, 298–308.
illness. American Journal of Psychotherapy, 39, 515– Gough, H. G. , & Bradley, P. (1996). CPI manual. Palo
524. Alto, CA: Consulting Psychologists Press.
Gottschalk, L. A., & Gleser, G. C. (1969). The mea- Gough, H., & Domino, G. (1963). The D-48 test
surement of psychological states through the content as a measure of general ability among grade-
analysis of verbal behavior. Los Angeles, CA: Univer- school children. Journal of Consulting Psychology, 27,
sity of California Press. 344–349.
Gough, H. G. (1947). Simulated patterns on the MMPI. Gough, H. G., & Heilbrun, A. L. (1965). The Adjec-
Journal of Abnormal and Social Psychology, 42, 215– tive Check List Manual. Palo Alto, CA: Consulting
225. Psychologists Press.
Gough, H. G. (1950). The F minus K dissimulation Gough, H. G., & Heilbrun, A. B., Jr. (1983). The Adjec-
index for the MMPI. Journal of Consulting Psychol- tive Check List manual-1983 edition. Palo Alto, CA:
ogy, 14, 408–413. Consulting Psychologists Press.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 567

Gould, J. (1982). A psychometric investigation of the & Segal, D. L. (Eds.), Comprehensive handbook of
standard and long-form Beck Depression Inventory. Psychological assessment. Hoboken, NJ: Wiley &
Psychological Reports, 51, 1167–1170. Sons.
Gould. S. J. (1981). The mismeasure of man. New York: Greene, A. C., Sapp, G. L., & Chissom, B. (1990).
W. W. Norton. Validation of the Stanford-Binet Intelligence Scale:
Gowan, J. C., & Gowan, M. S. (1955). A teacher prog- Fourth edition with exceptional black male students.
nosis scale for the MMPI. Journal of Educational Psychology in the Schools, 27, 35–41.
Research, 49, 1–12. Greene, R. L. (1978). An empirically derived MMPI
Gracely, R. H., McGrath, P., & Dubner, R. (1978). Ratio Carelessness Scale. Journal of Clinical Psychology, 34,
scales of sensory and affective verbal pain descrip- 407–410.
tions. Pain, 5, 5–18. Greene, R. L. (1980). The MMPI: An interpretive man-
Graham, C., Bond, S. S., Gerkovich, M. M., & Cook, ual. New York: Grune & Stratton.
M. R. (1980). Use of the McGill Pain Questionnaire Greene, R. L. (1987). Ethnicity and MMPI per-
in the assessment of cancer pain: Replicability and formance: A review. Journal of Consulting and Clin-
consistency. Pain, 8, 377–387. ical Psychology, 55, 497–512.
Graham, E. E., & Shapiro, E. (1963). Use of the Perfor- Greene, R. L. (1988). Assessment of malingering and
mance Scale of the WISC with the deaf child. Journal defensiveness by objective personality inventories.
of Consulting Psychology, 17, 396–398. In R. Rogers (Ed.), Clinical assessment of malingering
Graham, J. R. (1987). The MMPI: A practical guide and deception (pp. 123–158). New York: Guilford
(2nd ed.). New York: Oxford University Press. Press.
Graham, J. R. (1990). MMPI-2: Assessing personality Greene, R. L. (1991). The MMPI-2/MMPI: An inter-
and psychopathology. New York: Oxford University pretive manual. Boston, MA: Allyn & Bacon.
Press. Greenspoon, J., & Gersten, C. D. (1967). A new look at
Graham, J. R. (1993). MMPI-2: Assessing personality psychological testing: Psychological testing from the
and psychopathology (2nd ed.). New York: Oxford standpoint of a behaviorist. American Psychologist,
University Press. 22, 848–853.
Graham, J. R., Watts, D., & Timbrook, R. E. (1991). Greitzer, F. L., Hershman, R. L., & Kelly, R. T. (1981).
Detecting fake-good and fake-bad MMPI-2 profiles. The air defense game: A microcomputer pro-
Journal of Personality Assessment, 57, 264–277. gram for research in human performance. Behav-
Grant, I., Sweetwood, H., Gerst, M. S., & Yager, J. ior Research Methods and Instrumentation, 13, 57–
(1978). Scaling procedures in life events research. 59.
Journal of Psychosomatic Research, 22, 525–530. Gresham, F. M., & Elliott, S. N. (1990). Social Skills
Grassi, J. R. (1973). Grassi Basic Cognition Evaluation. Rating System: Manual. Circle Pines, MN: American
Miami, FL: University of Miami. Guidence Service.
Greaud, V. A., & Green, B. F. (1986). Equivalence of Gridley, B. E. (1991). Confirmatory factor analy-
conventional and computer presentation of speed sis of the Stanford-Binet: Fourth edition for a
tests. Applied Psychological Measurement, 10, 23–34. normal sample. Journal of School Psychology, 29,
Green, B. F. (1954). Attitude measurement. In G. Lind- 237–248.
sey (Ed.), Handbook of social psychology (pp. 335– Gridley, B. E., & Treloar, J. H. (1984). The validity of
369) Cambridge, MA: Addison-Wesley. the scales for rating the Behavioral Characteristics
Green, B. F. (1988). Critical problems in computer- of Superior Students for the identification of gifted
based psychological measurement. Applied Mea- students. Journal of Psychoeducational Assessment, 2,
surement in Education, 1, 223–231. 65–71.
Green, B. F. (1991). Guidelines for computer testing. Griffin, C. H. (1959). The development of processes
In T. B. Gutkin & S. L. Wise (Eds.), The computer and for indirect or synthetic validity. 5. Application of
the decision-making process (pp. 245–273). Hillsdale, motion and time analysis to dexterity tests. Personnel
NJ: Lawrence Erlbaum. Psychology, 12, 418–420.
Green, C. J. (1982). The diagnostic accuracy and utility Griffiths, R. (1970). The abilities of young children.
of MMPI and MCMI computer interpretive reports. Chard, England: Young & Son.
Journal of Personality Assessment, 46, 359–365. Grigoriadis, S., & Fekken, G. C. (1992). Person reli-
Green, R. F., & Goldfried, M. R. (1965). On the bipo- ability on the Minnesota Multiphasic Personality
larity of semantic space. Psychological Monographs, Inventory. Personality and Individual Differences, 13,
79, 6 (Whole No. 599). 491–500.
Greenbaum, P. E., Dedrick, R. F., & Lipien, L. (2004). Grisso, T. (1986). Evaluating competencies: Forensic
The Child Behavior Checklist. In Hilsenroth, M. J., assessments and instruments. New York: Plenum.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

568 References

Gronlund, N. E. (1959). Sociometry in the classroom. Guilford, J. P. (1988). Some changes in the Structure-
New York: Harper & Brothers. of-Intellect Model. Educational and Psychological
Gronlund, N. E. (1993). How to make achievement Measurement, 48, 1–4.
tests and assessments (5th ed.). Boston, MA: Allyn Guilford, J. P. & Fruchter, B. (1978). Fundamental
& Bacon. statistics in psychology and education. New York:
Grønnerød, C. (1999). Rorschach interrater agreement McGraw-Hill.
estimates: An empirical evaluation. Scandinavian Guion, R. M. (1965a). Personnel testing. New York:
Journal of Psychology, 40, 115–120. McGraw-Hill.
Gross, J., Rosen, J. C., Leitenberg, H., & Willmuth, Guion, R. M. (1965b). Synthetic validity in a small
M. E. (1986). Validity of the Eating Attitudes Test company: A demonstration. Personnel Psychology,
and the Eating Disorders Inventory in bulimia ner- 18, 49–63.
vosa. Journal of Consulting and Clinical Psychology, Guion, R. M. (1977). Content validity – the source of
54, 875–876. my discontent. Applied Psychological Measurement,
Grossman, F. M., & Johnson, K. M. (1983). Validity of 1, 1–10.
the Slosson and Otis-Lennon in predicting achieve- Guion, R. M. (1980). On trinitarian doctrines of valid-
ment of gifted students. Educational and Psycholog- ity. Professional Psychology, 11, 385–398.
ical Measurement, 43, 617–622. Guion, R. M., & Gottier, R. F. (1965). Validity of per-
Grosze-Nipper, L. M. H., & Rebel, H. J. C. (1987). sonality measures in personnel selection. Personnel
Dimensions of pacifism and militarism. In H. J. C. Psychology, 18, 135–164.
Rebel & L. Wecke (Eds.), Friends, foes, values and Gustafsson, J. E. (1984). A unifying model for the
fears. (pp. 55–91). Amsterdam: Jan Mets. structure of intellectual abilities. Intelligence, 8,
Grotevant, H. D., Scarr, S., & Weinberg, R. A. (1977). 179–203.
Patterns of interest similarity in adoptive and bio- Gutkin, T. & Reynolds, C. R. (1980). Factorial similar-
logical families. Journal of Personality and Social Psy- ity of the WISC-R for Anglos and Chicanos referred
chology, 35, 667–676. for psychological services. Journal of School Psychol-
Groth-Marnat, G. (1984). Handbook of psychological ogy, 18, 34–39.
assessment. New York: Van Nostrand Reinhold Co. Gutkin, T. B., & Reynolds, C. R. (1981). Factorial sim-
Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., ilarity of the WISC-R for white and black children
& Nelson, C. (2000). Clinical versus mechanical from the standardization sample. Journal of Educa-
prediction: A meta-analysis. Psychological Assess- tional Psychology, 73, 227–231.
ment, 12, 19–30. Gutterman, J. E. (1985). Correlations of scores of low
Guastello, S. J., Guastello, D. D., & Craft, L. L. (1989). vision children on the Perkins-Binet Tests of Intel-
Assessment of the Barnum effect in computer-based ligence for the Blind, the WISC-R and the WRAT.
test interpretations. The Journal of Psychology, 123, Journal of Visual Impairment and Blindness, 79, 55–
477–484. 58.
Guastello, S. J., & Rieke, M. L. (1990). The Bar- Guttman, L. (1944). A basis for scaling qualitative data.
num effect and validity of computer-based test American Sociological Review, 9, 139–150.
interpretations: The human resource development Guttman, L. (1945). A basis for analysing test-retest
report. Psychological Assessment, 2, 186–190. reliability. Psychometrika, 10, 255–282.
Guertin, W. H., Ladd, C. E., Frank, G. H., Rabin, A. I., Gynther, M. D. (1972). White norms and Black
& Hiester, D. S. (1966). Research with the Wechsler MMPI’s: A prescription for discrimination. Psycho-
intelligence scales for adults: 1960–1965. Psycholog- logical Bulletin, 78, 386–402.
ical Bulletin, 66, 385–409. Gynther, M. D. (1979). Aging and personality. In J. N.
Guilford, J. P. (1954). Psychometric methods. New York: Butcher (Ed.), New developments in the use of the
McGraw-Hill. MMPI (pp. 39–68). Minneapolis: University of Min-
Guilford, J. P. (1959a). Three faces of intellect. Amer- nesota Press.
ican Psychologist, 14, 469–479. Gynther, M. D., Fowler, R. D., & Erdberg, P. (1971).
Guilford, J. P. (1959b). Personality. New York: False positives galore: The application of standard
McGraw-Hill. MMPI criteria to a rural, isolated, Negro sample.
Guilford, J. P. (1967a) Autobiography. In E. G. Boring Journal of Clinical Psychology, 27, 234–237.
& G. Lindzey (Eds.), A history of psychology in autobi- Gynther, M. D., & Gynther, R. A. (1976). Personality
ography (Vol. 5, pp. 167–192). New York: Appleton- inventories. In I. B. Weiner (Ed.), Clinical methods
Century-Crofts. in psychology. New York: Wiley.
Guilford, J. P. (1967b). The nature of human intelli- Hackman, J. R., Wiggins, N., & Bass, A. R. (1970).
gence. New York: McGraw-Hill. Prediction of long-term success in doctoral work in
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 569

psychology. Educational and Psychological Measure- Halverson, C. F. (1995). Measurement beyond the indi-
ment, 30, 365–374. vidual. In J. C. Conoley & E. B. Werth (Eds.), Family
Haemmerlie, F. M., & Merz, C. J. (1991). Concur- assessment (pp. 3–18). Lincoln, NB: Buros Institute
rent validity between the California Psychological of Mental Measurements.
Inventory-Revised and the Student Adaptation to Hambleton, R. K. (1984). Validating the test score. In
College questionnaire. Journal of Clinical Psychol- R. A. Berk (Ed.), A guide to criterion-referenced test
ogy, 47, 664–668. construction (pp. 199–230). Baltimore, MD: Johns
Hagen, E., Delaney, E., & Hopkins, T. (1987). Stanford- Hopkins University Press.
Binet Intelligence Scale examiner’s handbook: An Hambleton, R. K., & Murphy, E. (1992). A psychome-
expanded guide for fourth edition users. Chicago, IL: tric perspective on authentic measurement. Applied
Riverside. Measurement in Education, 5, 1–16.
Hainsworth, P. K., & Hainsworth, M. L. (1980). Hambleton, R. K., & Swaminathan, H. (1985). Item
Preschool Screening System. Pawtucket, RI: Early response theory: Principles and applications. Boston,
Recognition Intervention Systems. MA: Kluwer-Nijhoff.
Hakstian, A. R., & Farrell, S. (2001). An openness scale Hamilton, S. (1960). A rating scale for depression. Jour-
for the California Psychological Inventory. Journal nal of Neurology, Neurosurgery and Psychiatry, 23,
of Personality Assessment, 76, 107–134. 56–62.
Haladyna, T. M., & Downing, S. M. (1989a). A taxon- Hammen, C. L. (1980). Depression in college students:
omy of multiple-choice item-writing rules. Applied Beyond the Beck Depression Inventory. Journal of
Measurement in Education, 2, 37–50. Consulting and Clinical Psychology, 48, 126–128.
Haladyna, T. M., & Downing, S. M. (1989b). Validity of Hammer, E. F. (1978). The clinical application of pro-
a taxonomy of multiple-choice item-writing rules. jective drawings. Springfield, IL: Charles C Thomas.
Applied Measurement in Education, 2, 51–78. Hampton, N. H. (1986). Luria-Nebraska Neuropsy-
Haladyna, T., & Downing, S. (1994). How many chological Battery (LNNB) Forms I and II micro-
options is enough for a multiple-choice test item? computer diskette. Computers in Human Behavior,
Educational and Psychological Measurement, 53, 2, 92–93.
999–1009. Haney, W. (1981). Validity, vaudeville, and values.
Hale, R. L. (1978). The WISC-R as a predictor of American Psychologist, 36 1021–1034.
WRAT performance. Psychology in the Schools, 15, Haney, W., & Madaus, G. (1978). Making sense of
172–175. the competency testing movement. Harvard Edu-
Hale, R. L., & Landino, S. A. (1981). Utility of WISC- cational Review, 48, 462–484.
R subtest analysis in discriminating among groups Haney, W. M., Madaus, G. F., & Lyons, R. (1993).
of conduct problem, withdrawn, mixed, and non- The fractured marketplace for standardized testing.
problem boys. Journal of Consulting and Clinical Psy- Boston, MA: Kluwer Academic Publishers.
chology, 49, 91–95. Hanford, G. H. (1985). Yes, the SAT does help colleges.
Hall, C. S., & Lindzey, G. (1970). Theories of personality. Harvard Educational Review, 55, 324–331.
New York: Wiley. Hannah, M. T., Domino, G., Figueredo, A. J., & Hen-
Hall, G., Bansal, A., & Lopez, I. (1999). Ethnicity and drickson, R. (1996). The prediction of ego integrity
psychopathology: A meta-analytic review of 31 years in older persons. Educational and Psychological Mea-
of comparative MMPI/MMPI-2 research. Psycho- surement, 56, 930–950.
logical Assessment, 11, 186–197. Hansen, J. C. (1984). User’s guide for the SVIB-SCII.
Hall, H. V., & Pritchard, D. A. (1996). Detecting malin- Stanford, CA: Stanford University Press.
gering and deception. Delray Beach, FL: St. Lucie Hansen, J. C. (1986). Strong Vocational Interest
Press. Blank/Strong-Campbell Interest Inventory. In W. B.
Haller, N., & Exner, J. E. Jr. (1985). The reliability of Walsh & S. H. Osipow (Eds.), Advances in Voca-
Rorschach variables for inpatients presenting symp- tional Psychology (Vol. 1, pp. 1–29). Hillsdale, NJ:
toms of depression and/or helplessness. Journal of Lawrence Erlbaum Associates.
Personality Assessment, 49, 516–521. Hansen, J. C., & Campbell, D. P. (1985). Manual for the
Halpin, G., Halpin, G., & Schaer, B. B. (1981). Relative SVIB-SCII (4th ed.). Stanford, CA: Stanford Univer-
effectiveness of the California Achievement Tests sity Press.
in comparison with the ACT Assessment, College Hansen, R., Young, J., & Ulrey, G. (1982). Assess-
Board Scholastic Aptitude Test, and High school ment considerations with the visually handicapped
grade point average in predicting college grade child. In G. Ulrey & S. Rogers (Eds.), Psychological
point average. Educational and Psychological Mea- assessment of handicapped infants and young chil-
surement, 41, 821–827. dren. (pp. 108–114). New York: Thieme-Stratton.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

570 References

Hanson, S., Buckelew, S. P., Hewett, J., & O’Neal, Hartlage, L. C., & Steele, C. T. (1977). WISC and WISC-
G. (1993). The relationship between coping and R correlates of academic achievement. Psychology in
adjustment after spinal cord injury: A 5-year the Schools, 14, 15–18.
follow-up study. Rehabilitation Psychology, 38, Hartley, J., & Holt, J. (1971). A note on the validity
41–52. of the Wilson-Patterson measure of conservatism.
Hardy, J. B., Welcher, D. W., Mellits, E. D., & Kagan, J. British Journal of Social and Clinical Psychology, 10,
(1976). Pitfalls in the measurement of intelligence: 81–83.
Are standard intelligence tests valid instruments Hartman, A. A. (1970). A basic T.A.T. set. Journal of
for measuring the intellectual potential of urban Projective Techniques and Personality Assessment, 34,
children? Journal of Psychology, 94, 43–51. 391–396.
Harkness, A. R., McNulty, J. L., & Ben-Porath, Y. S. Hartman, D. E. (1986). Artificial intelligence or arti-
(1995). The Personality Psychopathology Five (PSY- ficial psychologist? Conceptual issues in clinical
5): Constructs and MMPI-2 scales. Psychological microcomputer use. Professional Psychology, 17,
Assessment, 7, 104–114. 528–534.
Harmann, H. H. (1960). Modern factor analysis. Hartmann, D. P. (1977). Considerations in the choice
Chicago, IL: University of Chicago Press. of interobserver reliability estimates. Journal of
Harmon, M. G., Morse, D. T., & Morse, L. W. (1996). Applied Behavior Analysis, 10, 103–116.
Confirmatory factor analysis of the Gibb experi- Hartmann, D. P., Roper, B. L., & Bradford, D. C. (1979).
mental test of testwiseness. Educational and Psycho- Source relationships between behavioral and tradi-
logical Measurement, 56, 276–286. tional assessment. Journal of Behavioral Assessment,
Harnett, R. T., & Willingham, W. W. (1980). The cri- 1, 3–21.
terion problem: What measure of success in grad- Hartmann, G. (1938). The differential validity of items
uate education? Applied Psychological Measurement, in a Liberalism-Conservatism Test. Journal of Social
4, 281–291. Psychology, 9, 67–78.
Harrell, T. W. (1992). Some history of the Army Gen- Hartnett, R. T., & Willingham, W. W. (1980). The cri-
eral Classification Test. Journal of Applied Psychology, terion problem: What measure of success in grad-
77, 875–878. uate education? Applied Psychological Measurement,
Harrington, R. G., & Follett, G. M. (1984). The 4, 281–291.
readability of child personality assessment instru- Hartshorne, H., & May, M. A. (1928). Studies in deceit.
ments. Journal of Psychoeducational Assessment, 2, New York: Macmillan.
37–48. Hartshorne, T. S. (1993). Psychometric properties and
Harrington, R. G., & Jennings, V. (1986). Compari- confirmatory factor analysis of the UCLA Loneliness
son of three short forms of the McCarthy Scales of scale. Journal of Personality Assessment, 61, 182–195.
Children’s Abilities. Contemporary Educational Psy- Hase, H. D., & Goldberg, L. R. (1967). Comparative
chology, 11, 109–116. validity of different strategies of constructing per-
Harris, D. B. (1963). Children’s drawings as measures sonality inventory scales. Psychological Bulletin, 67,
of intellectual maturity. New York: Harcourt, Brace 231–248.
& World. Hathaway, S. R., & McKinley, J. C. (1943). The Min-
Harris, M. M., & Schaubroeck, J. (1988). A metaanaly- nesota Multiphasic Personality Inventory. Minneapo-
sis of self-supervisor, self-peer, and peer-supervisor lis, MN: University of Minnesota Press.
ratings. Personnel Psychology, 41, 43–62. Hathaway, S. R., McKinley, J. C., Butcher, J. N.,
Harrower, M. R., & Herrmann, R. (1953). Psychological Dahlstrom, W. G., Graham, J. R., Tellegen, A., &
factors in the care of patients with multiple scleroses: Kaemmer, B. (1989). MMPI-2: Manual for admin-
For use of physicians. New York: National Multiple istration and scoring. Minneapolis, MN: University
Sclerosis Society. of Minnesota Press.
Hart, S. D., Roesch, R., Corrado, R. R., & Cox, Hathaway, S. R., & Meehl, P. E. (1951). An atlas for the
D. N. (1993). The Referral Decision Scale. Law and clinical use of the MMPI. Minneapolis, MN: Univer-
Human Behavior, 17, 611–623. sity of Minnesota Press.
Hartigan, J. A., & Wigdor, A. K. (1989). Fairness Hawkins, N. C., Davies, R., & Holmes, T. H. (1957).
in employment testing. Washington, DC: National Evidence of psychosocial factors in the develop-
Academy Press. ment of pulmonary tuberculosis. American Review
Hartlage, L. C. (1987). Diagnostic assessment in reha- of Tuberculosis and Pulmonary Disorders, 75, 768–
bilitation. In B. Bolton (Ed.), Handbook of mea- 780.
surement and evaluation in rehabilitation (2nd ed., Hayden, D. C., Furlong. M. J., & Linnemeyer, S.
pp. 141–149). Baltimore, MD: Paul H. Brookes. (1988). A comparison of the Kaufman Assessment
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 571

Battery for Children and the Stanford-Binet IV for 80–16). Washington, DC: Personnel Research and
the assessment of gifted children. Psychology in the Development Center.
Schools, 25, 239–243. Heckhausen, H. (1967). The anatomy of achievement
Hayes, F. B., & Martin, R. P. (1986). Effectiveness of the motivation. New York: Academic Press.
PPVT-R in the screening of young gifted children. Hedlund, J. L., & Vieweg, B. W. (1979). The Zung Self-
Journal of Psychoeducational Assessment, 4, 27–33. Rating Depression Scale: A comprehensive review.
Hayes, S. C., Nelson, R. O., & Jarrett, R. B. (1986). Journal of Operational Psychiatry, 10, 51–64.
Evaluating the quality of behavioral assessment. In Heil, J., Barclay, A., & Endres, J. M. (1978). A fac-
R. O. Nelson & S. C. Hayes (Eds.), Conceptual foun- tor analytic study of WPPSI scores of educationally
dations of behavioral assessment (pp. 463–503). New deprived and normal children. Psychological Reports,
York: Guilford Press. 42, 727–730.
Hayes, S. N., Floyd, F. J., Lemsky, C., Rogers, E., Wine- Heilbrun, A. B., Jr. (1972). Review of the EPPS. In
miller, D., Heilman, N., Werle, M., Murphy, T., & O. K. Buros (Ed.), The seventh mental measurements
Cardone, L. (1992). The Marital Satisfaction Ques- yearbook (Vol. 1, pp. 148–149). Highland Park, NJ:
tionnaire for older persons. Psychological Assess- Gryphon Press.
ment, 4, 473–482. Heilbrun, A. B., & Goodstein, L. D. (1961). The rela-
Hayes, S. N., Richard, D. C. S., & Kubany, E. S. (1995). tionships between individually defined and group
Content validity in psychological assessment: A defined social desirability and performance on the
functional approach to concepts and methods. Psy- Edwards Personal Preference Schedule. Journal of
chological Assessment, 7, 238–247. Consulting Psychology, 25, 200–204.
Hayes, S. P. (1929). The new revision of the Binet Intel- Heilbrun (1992). The role of psychological testing in
ligence Tests for the blind. Teachers Forum, 2, 2–4. forensic assessment. Law and Human Behavior, 16,
Haynes, J. P., & Bensch, M. (1981). The PV sign on the 257–272.
WISC-R and recidivism in delinquents. Journal of Heim, A. W. (1975). Psychological testing. London:
Consulting and Clinical Psychology, 49, 480–481. Oxford University Press.
Haynes, S. N., & Wilson, C. C. (1979). Behavioral Helmes, E., & Reddon, J. R. (1993). A perspective on
assessment. San Francisco, CA: Jossey-Bass. developments in assessing psychopathology: A crit-
Hays, D. G., & Borgatta, E. P. (1954). An empirical ical review of the MMPI and MMPI-2. Psychological
comparison of restricted and general latent distance Bulletin, 113, 453–471.
analysis. Psychometrika, 19, 271–279. Helms, J. E. (1992). Why is there no study of cultural
Hays, R. D., & DiMatteo, M. R. (1987). A short-form equivalence in standardized cognitive ability testing?
measure of loneliness. Journal of Personality Assess- American Psychologist, 47, 1083–1101.
ment, 51, 69–81. Hendershott, J. L., Searight, H. R., Hatfield, J. L.,
Hayslip, B., Jr. (1984). Idiographic assessment of the & Rogers, B. J. (1990). Correlations between the
self in the aged: A case for the use of the Q-sort. Inter- Stanford-Binet, fourth edition and the Kaufman
national Journal of Aging and Human Development, Assessment Battery for Children for a preschool
20, 293–311. sample. Perceptual and Motor Skills, 71, 819–
Heath, R. L., & Fogel, D. S. (1978). Terminal and 825.
instrumental? An inquiry into Rokeach’s Value Sur- Henderson, R. W., & Rankin, R. J. (1973). WPPSI reli-
vey. Psychological Reports, 42, 1147–1154. ability and predictive validity with disadvantaged
Heaton, R. K., Grant, I., Anthony, W. Z., & Lehman, Mexican-American children. Journal of School Psy-
R. A. W. (1981). A comparison of clinical and auto- chology, 11, 16–20.
mated interpretation of the Halstead-Reitan battery. Henerson, M. E., Morris, L. L., & Fitz-Gibbon, C. T.
Journal of Clinical Neuropsychology, 3, 121–141. (1987). How to measure attitudes. Newbury Park,
Heaton, R. K., Smith, H. H., Jr., Lehman, R. A. W., CA: Sage.
& Vogt, A. J. (1978). Prospects for faking believ- Henry, P., Bryson, S., & Henry, C. A. (1990). Black
able deficits on neuropsychological testing. Journal student attitudes toward standardized tests: Does
of Consulting and Clinical Psychology, 46, 892–900. gender make a difference? College Student Journal,
Heaton, R. R., Grant, I., & Matthews, C. G. (1991). 23, 346–354.
Comprehensive norms for an expanded Halstead- Henry, W. E. (1956). The analysis of fantasy: The the-
Reitan Battery. Odessa, FL: Psychological Assess- matic apperception technique in the study of person-
ment Resources. ality. New York: Wiley.
Heaton, S. M., Nelson, A. V., & Nester, M. A. (1980). Herrmann, D. J. (1982). Know thy memory: The use
Guide for administering examinations to handi- of questionnaires to assess and study memory. Psy-
capped individuals for employment purposes (PRR chological Bulletin, 92, 434–452.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

572 References

Hersen, M. (1973). Self-assessment of fear. Behavior Hilliard, A. G. (1975). The strengths and weaknesses of
Therapy, 4, 241–257. cognitive tests for young children. In J. D. Andrews
Hersen, M., & Bellack, A. S. (1977). Assessment of (Ed.), One child indivisible. (pp. 17–33). Washing-
social skills. In A. R. Ciminero, K. S. Calhoun, & ton, DC: National Association for the Education of
H. E. Adams (Eds.), Handbook of behavioral assess- Young Children.
ment (pp. 509–554). New York: Wiley. Hills, J. R., Bush, M., & Klock, J. A. (1964). Predict-
Hersen, M., & Bellack, A. S. (Eds.). (1981). Behavioral ing grades beyond the freshman year. College Board
assessment. Elmsford, NY: Pergamon. Review, 54, 22–24.
Hertzog, C., & Schear, J. M. (1989). Psychometric con- Hilton, T. L., & Korn, J. H. (1964). Measured change in
siderations in testing the older person. In T. Hunt & personal values. Educational and Psychological Mea-
C. J. Lindley (Eds.), Testing older adults (pp. 24–50). surement, 24, 609–622.
Austin, TX: Pro-ed. Himmelfarb, S. (1984). Age and sex differences in the
Herzog, A. R., & Rodgers, W. L. (1981). Age mental health of older persons. Journal of Consulting
and satisfaction: Data from several large surveys. and Clinical Psychology, 52, 844–856.
Research on Aging, 3, 142–165. Hinckley, E. D. (1932). The influence of individual
Hess, A. K. (1985). Review of Millon Clinical Multiaxial opinion on construction of an attitude scale. Journal
Inventory. In J. V. Mitchell, Jr. (Ed.), The ninth men- of Social Psychology, 3, 283–296.
tal measurements yearbook (Vol. 1, pp. 984–986). Hirshoren, A., Hurley, O. L., & Hunt, J. T. (1977). The
Lincoln, NE: University of Nebraska Press. WISC-R and the Hiskey-Nebraska Test with deaf
Hess, D. W. (1969). Evaluation of the young deaf adult. children. American Annals of the Deaf, 122, 392–394.
Journal of Rehabilitation of the Deaf, 3, 6–21. Hirshoren, A., Hurley, O. L., & Kavale, K. (1979).
Hess, E. H. (1965). Attitude and pupil size. Scientific Psychometric characteristics of the WISC-R Perfor-
American, 212, 46–54. mance Scale with deaf children. Journal of Speech
Hess, E. H., & Polt, J. M. (1960). Pupil size as related and Hearing Disorders, 44, 73–79.
to interest value of visual stimuli. Science, 132, 349– Hiscock. M. (1978). Imagery assessment through self-
350. report: What do imagery questionnaires measure?
Hetzler, S. A. (1954). Radicalism-Conservatism and Journal of Consulting and Clinical Psychology, 46,
social mobility. Social Forces, 33, 161–166. 223–230.
Heveren, V. W. (1980). Recent validity studies of the Hiskey, M. S. (1966). Hiskey-Nebraska Test of Learning
Halstead-Reitan approach to clinical neuropsycho- Aptitude. Lincoln, NE: Union College Press.
logical assessment. Clinical Neuropsychology, 2, 49– Hobbs, S. A., & Walle, D. L. (1985). Validation of the
61. Children’s Assertive Behavior Scale. Journal of Psy-
Hevner, K. (1930). An empirical study of three psy- chopathology and Behavioral Assessment, 7, 145–153.
chophysical methods. Journal of General Psychology, Hocevar, D. (1979). A comparison of statistical infre-
4, 191–212. quency and subjective judgment as criteria in the
Hilgard, E. R. (1957). Lewis Madison Terman measurement of originality. Journal of Personality
1877–1956. American Journal of Psychology, 70, Assessment, 43, 297–299.
472–479. Hodges, W. F., & Spielberger, C. D. (1966). The effects
Hill, A. B., Kemp-Wheeler, S. M., & Jones, S. A. (1986). of threat of shock on heart rate for subjects who
What does Beck Depression Inventory measure in differ in manifest anxiety and fear of shock. Psy-
students? Personality and Individual Differences, 7, chophysiology, 2, 287–294.
39–47. Hoemann, H. W. (1972). Communication accuracy
Hill, K. T. (1972). Anxiety in the evaluative context. in a sign-language interpretation of a group test.
In W. Hartup (Ed.), The young child (Vol. 2). Journal of Rehabilitation of the Deaf, 5, 40–43.
Washington, DC: National Association for the Edu- Hoffman, R. A., & Gellen, M. I. (1983). The Tennessee
cation of Young Children. Self Concept Scale: A revisit. Psychological Reports,
Hill, K. T., & Sarason, S. B. (1966). The relation of test 53, 1199–1204.
anxiety and defensiveness to test and school per- Hogan, R. (1969). Development of an empathy scale.
formance over the elementary school years: A fur- Journal of Consulting and Clinical Psychology, 33,
ther longitudinal study. Monographs of the Society 307–316.
for Research in Child Development, 31 (No. 2). Hogan R. (1989a). Review of the Personality Research
Hiller, J. B., Rosenthall, R., Bornstein, R., Berry, D., Form (3rd ed.). In J. C. Conoley & J. J. Kramer
& Brunell-Neuleib, S. (1999). A comparative meta- (Eds.), The tenth mental measurements yearbook
analysis of Rorschach and MMPI validity. Psycho- (pp. 632–633). Lincoln, NB: University of Nebraska
logical Assessment, 11, 278–296. Press.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 573

Hogan, R. (1989b). Review of the NEO Personal- Hollandsworth, J. G., Galassi, J. P., & Gay, M. L. (1977).
ity Inventory. In J. C. Conoley & J. J. Kramer The Adult Self Expression Scale: Validation by the
(Eds.), The tenth mental measurements yearbook multitrait-multimethod procedure. Journal of Clin-
(pp. 546–547). Lincoln, NE: University of Nebraska ical Psychology, 33, 407–415.
Press. Hollenbeck, G. P., & Kaufman, A. S. (1973). Factor
Hogan, R. (1990). What kinds of tests are useful in analysis of the Wechsler Preschool and Primary Scale
organizations? In J. Hogan & R. Hogan (Eds.), Busi- of Intelligence (WPPSI). Journal of Clinical Psychol-
ness and industry testing (pp. 22–35). Austin, TX: ogy, 29, 41–45.
Pro-Ed. Hollinger, R. D., & Clark, J. P. (1983). Theft by employ-
Hogan, R., DeSoto, C. B., & Solano, C. (1977). Traits, ees. Lexington, MA: Lexington Books.
tests, and personality research. American Psycholo- Hollrah, J. L., Schlottmann, R. S., Scott, A. B., &
gist, 32, 255–264. Brunetti, D. G. (1995). Validity of the MMPI subtle
Hogan, R. T. (1983). A socioanalytic theory of person- items. Journal of Personality Assessment, 65, 278–
ality. In M. Page (Ed.), 1982 Nebraska Symposium 299.
on Motivation (pp. 55–89). Lincoln, NB: University Holm, C. S. (1987). Testing for values with the deaf: The
of Nebraska Press. language/cultural effect. Journal of Rehabilitation of
Hoge, D. R., & Bender, I. E. (1974). Factors influencing the Deaf, 20, 7–19.
value change among college graduates in adult life. Holmes, T. H., & Rahe, R. H. (1967). The Social
Journal of Personality and Social Psychology, 29, 572– Readjustment Rating Scale. Journal of Psychosomatic
585. Research, 11, 213–218.
Hoge, S. K., Bonnie, R. J., Poythress, N., & Monahan, Holt, R. R. (1951). The Thematic Apperception Test. In
J. (1999). The MacArthur Competence Assessment H. H. Anderson & G. L. Anderson (Eds.), An intro-
Tool – Criminal Adjudication. Odessa, FL: Psycho- duction to projective techniques. Englewood Cliffs,
logical Assessment Resources. NJ: Prentice Hall.
Hojat, M. (1982). Psychometric characteristics of the Holt, R. R. (1958). Clinical and statistical prediction: A
UCLA Loneliness Scale: A study with Iranian col- reformulation and some new data. Journal of Abnor-
lege students. Educational and Psychological Mea- mal and Social Psychology, 56, 1–12.
surement, 42, 917–925. Holt, R. R. (1971). Assessing personality. New York:
Holaday, M., Smith, D. A., & Sherry, A. (2000). Sen- Harcourt Brace Jovanovich.
tence completion tests: A review of the literature Holtzman, W. H. (1968). Cross-cultural studies in psy-
and results of a survey of members of the Society chology. International Journal of Psychology, 3, 83–
for Personality Assessment. Journal of Personality 91.
Assessment, 74, 371–383. Holtzman, W. H., Thorpe, J. S., Swartz, J. D., & Herron,
Holahan, C. K., & Holahan, C. J. (1987). Life stress, E. W. (1961). Inkblot perception and personality –
hassles, and self-efficacy in aging: A replication and Holtzman Inkblot Technique. Austin, TX: University
extension. Journal of Applied Social Psychology, 17, of Texas Press.
574–592. Honaker, L. M., Hector, V. S., & Harrell, T. H.
Holden, R. R., & Fekken, G. C. (1989). Three common (1986). Perceived validity of computer-vs. clinician-
social desirability scales: Friends, acquaintances, or generated MMPI reports. Computers in Human
strangers. Journal of Research in Personality, 23, 180– Behavior, 2, 77–83.
191. Hong, K. E., & Holmes, T. H. (1973). Transient diabetes
Holland, J. L. (1966). The psychology of vocational mellitus associated with culture change. Archives of
choice. Waltham, MA: Blaisdell. General Psychiatry, 29, 683–687.
Holland, J. L. (1973). Making vocational choices: A the- Hopkins, K. D., & McGuire, L. (1966). Mental mea-
ory of careers. Englewood Cliffs, NJ: Prentice-Hall. surement of the blind: The validity of the Wechsler
Holland, J. L. (1985a). Making vocational choices: A Intelligence Scale for Children. International Journal
theory of vocational personalities and work environ- for the Education of the Blind, 15, 65–73.
ments (2nd ed.). Englewood Cliffs, NJ: Prentice Hops, H., & Lewin, L. (1984). Peer sociometric forms.
Hall. In T. H. Ollendick & M. Hersen (Eds.), Child
Holland, J. L. (1985b). Professional manual for the self- Behavioral Assessment (pp. 124–147). New York:
directed search. Odessa, FL: Psychological Assess- Pergamon.
ment Resources. Horiguchi, J., & Inami, Y. (1991). A survey of the liv-
Holland, J. L., & Nichols, R. C. (1964). Prediction of ing conditions and psychological states of elderly
academic and extracurricular achievement in col- people admitted to nursing homes in Japan. Acta
lege. Journal of Educational Psychology, 55, 55–65. Psychiatrica Scandinavica, 83, 338–341.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

574 References

Horn, J. (1986). Intellectual ability concepts. In R. J. Measurement and conceptual issues. Child Develop-
Sternberg (Ed.), Advances in the psychology of human ment, 53, 571–600.
intelligence (Vol. 3). Hillsdale, NJ: Erlbaum. Hudgens, R. W. (1974). Personal catastrophe and
Horn, J. L., Wanberg, K. W., & Foster, F. M. (1987). depression: A consideration of the subject with
Guide to the Alcohol Use Inventory. Minneapolis, respect to medically ill adolescents, and a requiem
MN: National Computer Systems. for retrospective life-event studies. In B. S. Dohren-
Hornick, C. W., James, L. R., & Jones, A. P. (1977). wend, & B. P. Dohrenwend (Eds.), Stressful life
Empirical item keying vs. a rational approach to events: Their nature and effects (pp. 119–134). New
analyzing a psychological climate questionnaire. York: Wiley.
Applied Psychological Measurement, 1, 489–500. Hui, C. H., & Triandis, H. C. (1985). Measurement in
Horowitz, M. D., Wilner, N., & Alvarez, W. (1979). cross-cultural psychology. Journal of Cross-Cultural
Impact of event scale: A measure of subjective stress. Psychology, 16, 131–152.
Psychosomatic Medicine, 41, 209–218. Hui, C. H., & Triandis, H. C. (1989). Effects of cul-
Hough, L. M. (1992). The “Big Five” personality ture and response format on extreme response style.
variables – construct confusion: Description vs. Journal of Cross-Cultural Psychology, 20, 296–309.
prediction. Human Performance, 5, 139–155. Huitema, B. E., & Stein, C. R. (1993). Validity of
Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, the GRE without restriction of range. Psychological
J. D., & McCloy, R. A. (1990). Criterion-related Reports, 72, 123–127.
validities of personality constructs and the effect of Hull, J. G., Van Treuren, R. R., & Virnelli, S. (1987).
response distortion on those validities. Journal of Hardiness and health: A critique and alternative
Applied Psychology, 75, 581–585. approach. Journal of Personality and Social Psychol-
House, A. E., House, B. J., & Campbell, M. B. (1981). ogy, 53, 518–530.
Measures of interobserver agreement: Calculation Humphreys, L. G. (1962). The organization of human
formulas and distribution effects. Journal of Behav- abilities. American Psychologist, 17, 475–483.
ioral Assessment, 3, 37–57. Humphreys, L. G. (1985). Review of the System
House, J. D., & Johnson, J. J. (1993a). Graduate Record of Multicultural Pluralistic Assessment. In J. V.
Examination scores and academic background vari- Mitchell (Ed.), The ninth mental measurements year-
ables as predictors of graduate degree comple- book (pp. 1517–1519). Lincoln, NB: University of
tion. Educational and Psychological Measurement, Nebraska Press.
35, 551–556. Hunsley, J., Vito, D., Pinsent, C., James, S., & Lefebvre,
House, J. D., & Johnson, J. J. (1993b). Predictive valid- M. (1996). Are self-report measures of dyadic rela-
ity of the Graduate Record Examination Advanced tionships influenced by impression management
Psychology Test for graduate grades. Psychological biases? Journal of Family Psychology, 10, 322–330.
Reports, 73, 184–186. Hunter, J. E. (1986). Cognitive ability, cognitive apti-
House, J. D., Johnson, J. J., & Tolone, W. L. (1987). tudes, job knowledge, and job performance. Journal
Predictive validity of the Graduate Record Exam- of Vocational Behavior, 29, 340–362.
ination for performance in selected graduate psy- Hunter, J. E., & Hunter, R. F. (1984). Validity and utility
chology courses. Psychological Reports, 60, 107–110. of alternative predictors of job performance. Psycho-
Houston, L. N. (1980). Predicting academic achieve- logical Bulletin, 96, 72–99.
ment among specially admitted black female col- Hunter, J. E., Schmidt, F. L., & Hunter, R. F. (1979).
lege students. Educational and Psychological Mea- Differential validity of employment tests by race:
surement, 40, 1189–1195. A comprehensive review and analysis. Psychological
Houtz, J. C., & Shaning, D. J. (1982). Contribution Bulletin, 86, 721–735.
of teacher ratings of behavioral characteristics to Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. M.
the prediction of divergent thinking and problem (1977). Fairness of psychological tests: Implications
solving. Psychology in the Schools, 19, 380–383. of four definitions for selection utility and minority
Hoyt, D. R., & Creech, J. C. (1983). The Life Satisfaction hiring. Journal of Applied Psychology, 62, 245–260.
Index: A methodological and theoretical critique. Huntley, C. W. (1965). Changes in Study of Values
Journal of Gerontology, 38, 111–116. scores during the four years of college. Genetic Psy-
Hu, S., & Oakland, T. (1991). Global and regional per- chology Monographs, 71, 349–383.
spectives on testing children and youth: An empir- Hurtz, G. M., & Hertz, N. M. R. (1999). How many
ical study. International Journal of Psychology, 26, raters should be used for establishing cutoff scores
329–344. with the Angoff method? A generalizability theory
Hubert, N. C., Wachs, T. D., Peters-Martin, P., & Gan- study. Educational and Psychological Measurement,
dour, M. J. (1982). The study of early temperament: 59, 885–897.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 575

Hutt, M. (1977). The Hutt adaptation of the Bender- Jackson, D. N. (1984). Personality Research Form man-
Gestalt (3rd ed.). New York: Grune & Stratton. ual (3rd ed.). Port Huron, MI: Research Psycholo-
Hyde, J. S., Fennema, E., & Lamon, S. J. (1990). Gender gists Press.
differences in mathematics performance: A meta- Jackson, D. N., & Messick, S. (1958). Content and style
analysis. Psychological Bulletin, 107, 139–155. in personality assessment. Psychological Bulletin, 55,
Hynd, G. W. (1988). Neuropsychological assessment 243–252.
in clinical child psychology. Newbury Park, CA: Jackson, D. N., & Minton, H. (1963). A forced-choice
Sage. adjective preference scale for personality assess-
Illerbrun, D., Haines, L., & Greenough, P. (1985). Lan- ment. Psychological Reports, 12, 515–520.
guage identification screening test for kindergarten: Jackson, J. L. (1999). Psychometric considerations
A comparison with four screening and three diag- in self-monitoring assessment. Psychological Assess-
nostic language tests. Language, Speech and Hearing ment, 11, 439–447.
Services in Schools, 16, 280–292. Jacob, S., & Brantley, J. C. (1987). Ethical-legal
Inglehart, R. (1985). Aggregate stability and problems with computer use and suggestions for
individual-level flux in mass belief systems: best practices: A national survey. School Psychology
The level of analysis paradox. American Political Review, 16, 69–77.
Science Review, 79, 97–116. Jacobs, J. W., Bernhard, M. R., Delgado, A., & Strain.
Insko, C. A., & Schopler, J. (1967). Triadic consistency: J. J. (1977). Screening for organic mental syndromes
A statement of affective-cognitive-conative consis- in the medically ill. Annals of Internal Medicine, 86,
tency. Psychological Review, 74, 361–376. 40–46.
Inwald, R., Knatz, H., & Shusman, E. (1983). Inwald Jacobson, L. I., Kellogg, R. W., Cauce, A. M., & Slavin,
Personality Inventory manual. New York: Hilson R. S. (1977). A multidimensional social desirabil-
Research. ity inventory. Bulletin of the Psychonomic Society, 9,
Ireton, H., & Thwing, E. (1979). Minnesota Preschool 109–110.
Inventory. Minneapolis, MN: Behavior Science Jacobson, L. I., Prio, M. A., Ramirez, M. A., Fernandez,
Systems. A. J., & Hevia, M. L. (1978). Construccion y valida-
Ironson, G. H., & Davis, G. A. (1979). Faking high or cion de una prueba de intelligencia para Cubanos.
low creativity scores on the Adjective Check List. Interamerican Journal of Psychology, 12, 39–45.
Journal of Creative Behavior, 13, 139–145. Jaeger, R. M. (1985). Review of Graduate Record Exam-
Irvine, S. H. (1969). Figural tests of reasoning in Africa. inations. In J. V. Mitchell, Jr. (Ed.), The ninth men-
International Journal of Psychology, 4, 217–228. tal measurements yearbook (Vol. 1, pp. 624–626).
Irwin, R. B. (1914). A Binet scale for the blind. New Lincoln: University of Nebraska Press.
Outlook for the Blind, 8, 95–97. Jaeger, R. M., Linn, R. L., & Tesh, A. S. (1989). A syn-
Iwawaki, S., & Cowen, E. L. (1964). The social desir- thesis of research on some psychometric properties
ability of trait descriptive terms: Applications to a of the GATB. In J. A. Hartigan & A. K. Wigdor
Japanese sample. Journal of Social Psychology, 63, (Eds.), Fairness in employment testing (pp. 303–324).
199–205. Washington, DC: National Academy Press.
Iwao, S., & Triandis, H. C. (1993). Validity of auto- Janda, L., & Galbraith, G. (1973). Social desirability
and heterostereotypes among Japanese and Amer- and adjustment in the Rotter Incomplete Sentences
ican students. Journal of Cross-Cultural Psychology, Blank. Journal of Consulting and Clinical Psychology,
24, 428–444. 40, 337.
Jaccard, J. (1981). Attributes and behavior: Implica- Janis, I. (1980). Personality differences in decision
tions of attitudes towards behavioral alternatives. making under stress. In K. Blandenstein, P. Pliner,
Journal of Experimental Social Psychology, 17, 286– & J. Polivy (Eds.), Assessment and modification
307. of emotional behavior (pp. 165–189). New York:
Jackson, D. N. (1967). Personality research form man- Plenum.
ual. Goshen, NY: Research Psychologists Press. Janz, N. K., & Becker, M. H. (1984). The health belief
Jackson, D. N. (1970). A sequential system for per- model: A decade later. Health Education Quarterly,
sonality scale development. In C. D. Spielberger 11, 1–47.
(Ed.), Current topics in clinical and community psy- Jastak, J. F., & Jastak, J. R. (1964). Short forms of
chology (Vol. 2, pp. 61–96). New York: Academic the WAIS and WISC Vocabulary subtests. Journal
Press. of Clinical Psychology, 20, 167–199.
Jackson, D. N. (1977). Manual for the Jackson Voca- Jastak, J. F., & Jastak, S. R. (1972). Manual for the Wide
tional Interest Survey. Port Huron, MI: Research Psy- Range Interest Opinion Test. Wilmington, DE: Guid-
chologists Press. ance Associates of Delaware.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

576 References

Jastak, J. F., & Jastak, S. (1979). Wide Range Interest- Joe, V. C., & Kostyla, S. (1975). Social attitudes and
Opinion Test. Wilmington, DE: Jastak Associates. sexual behaviors of college students. Journal of Con-
Jastrow, J. (1930). Autobiography. In C. Murchi- sulting and Clinical Psychology, 43, 430.
son (Ed.), A history of psychology in autobiography Johansson, C. B. (1975). Manual for the Career Assess-
(Vol. 1, pp. 135–162). Worcester, MA: Clark Univer- ment Inventory. Minneapolis, MN: National Com-
sity Press. puter Systems.
Jencks, C. (1972). Inequality: A reassessment of the effect Johansson, C. B. (1986). Manual for the Career
of family and schooling in America. New York: Harper Assessment Inventory-Enhanced Version. Minneapo-
& Row. lis, MN: National Computer Systems.
Jenkins, C. D., Rosenman, R. H., & Friedman, M. John, E., Cavanaugh, C., Krauss-Whitbourne, E. S.
(1967). Development of an objective psycholog- (Eds.). (1999). Gerontology: An interdisciplinary per-
ical test for the determination of the coronary- spective. New York: Wiley.
prone behavior pattern in employed men. Journal Johnson, D. G., Lloyd, S. M. Jr., Jones, R. F., & Ander-
of Chronic Diseases. 20, 371–379. son, J. (1986). Predicting academic performance at
Jenkins, C. D., Rosenman, R. H., & Zyzanski, a predominantly Black medical school. Journal of
S. J. (1974). Prediction of clinical coronary-prone Medical Education, 61, 629–639.
behavior pattern. New England Journal of Medicine, Johnson, D. F., & White, C. B. (1980). Effects of training
290, 1271–1275. on computerized test performance in the elderly.
Jenkins, C. D., & Zyzanski, S. J. (1980). Behavioral risk Journal of Applied Psychology, 65, 357–358.
factors and coronary heart disease. Psychotherapy Johnson, D. L., & McGowan, R. J. (1984). Com-
and Psychosomatics, 34, 149–177. parison of three intelligence tests as predictors of
Jensema. C. (1975a). A statistical investigation of the academic achievement and classroom behaviors of
16PF, Form E as applied to hearing-impaired college Mexican-American children. Journal of Psychoedu-
students. Journal of Rehabilitation of the Deaf, 9, 21– cational Assessment, 2, 345–352.
29. Johnson, E. G. (1992). The design of the National
Jensema, C. (1975b). Reliability of the 16 PF Form Assessment of Educational Progress. Journal of Edu-
E for hearing-impaired college students. Journal of cational Measurement, 29, 95–110.
Rehabilitation of the Deaf, 8, 14–18. Johnson, J. H., & Overall, J. E. (1973). Factor analysis
Jensen, A. R. (1969). How much can we boost IQ of the Psychological Screening Inventory. Journal of
and scholastic achievement? Harvard Educational Consulting and Clinical Psychology, 41, 57–60.
Review, 39, 1–23. Johnson, J. H., & Williams, T. A. (1980). Using on-
Jensen, A. R. (1974). How biased are culture-loaded line computer technology in a mental health admit-
tests? Genetic Psychology Monographs, 90, 185–244. ting system. In J. B. Sidowski, J. H. Johnson, &
Jensen, A. R. (1976). Test bias and construct validity. T. A. Williams (Eds.), Technology in mental health
Phi Delta Kappan, 58, 340–346. care delivery systems (pp. 237–249). Norwood, NJ:
Jensen, A. R. (1980). Bias in mental testing. New York: Ablex.
Free Press. Johnson, J. R., Null, C., Butcher, J. N., & Johnson, K. N.
Jensen, A. R. (1984). The black-white difference on the (1984). Replicated item level factor analysis of the
K-ABC: Implications for future tests. The Journal of full MMPI. Journal of Personality and Social Psychol-
Special Education, 18, 377–408. ogy, 47, 105–114.
Jensen, A. R. (1987). The g beyond factor analysis. In Johnson, R. W. (1972). Contradictory scores on the
R. R. Ronning, J. C. Conoley, J. A. Glover, & J. C. Strong Vocational Interest Blank. Journal of Coun-
Witt (Eds.), The influence of cognitive psychology on seling Psychology, 19, 487–490.
testing (pp. 87–142). Hillsdale, NJ: Erlbaum. Johnson, W. L., & Johnson, A. M. (1993). Validity of the
Jensen, A. R. (1998). The g factor. Westport, CT: quality of school life scale: A primary and second-
Praeger. order factor analysis. Educational and Psychological
Jensen, J. P., & Bergin, A. E. (1988). Mental health val- Measurement, 53, 145–153.
ues of professional therapists: A national interdisci- Johnson, W. R. (1981). Basic interviewing skills. In
plinary survey. Professional Psychology Research and C. E. Walker (Ed.), Clinical practice of psychology
Practice, 19, 290–297. (pp. 83–128). New York: Pergamon Press.
Joe, V. C. (1971). Review of the internal-external con- Joncich, G. (1966). Complex forces and neglected
trol construct as a personality variable. Psychological acknowledgments in the making of a young
Reports, 28, 619–640. psychologist: Edward L. Thorndike and his teach-
Joe, V. C. (1974). Personality correlates of conser- ers. Journal of the History of the Behavioral Sciences,
vatism. Journal of Social Psychology, 93, 309–310. 2, 43–50.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 577

Jones, D. H., & Ragosta, M. (1982). Predictive validity of tion of mental status in the aged. American Journal
the SAT on two handicapped groups: The deaf and the of Psychiatry, 117, 326–328.
learning disabled (82–9). Princeton, NJ: Educational Kalat, J. W., & Matlin, M. W. (2000). The GRE Psy-
Testing Service. chology test: A useful but poorly understood test.
Jones, E. E., & Sigall. H. (1971). The bogus pipeline: Teaching of Psychology, 27, 24–27.
A new paradigm for measuring affect and attitude. Kamphaus, R. W. (1993). Clinical assessment of
Psychological Bulletin. 76, 349–364. children’s intelligence. Boston, MA: Allyn &
Jones, J. W. (Ed.). (1991). Preemployment honesty test- Bacon.
ing. New York: Quorum Books. Kamphaus, R. W., & Lozano, R. (1984). Develop-
Jones, R. F. (1986). The effect of commercial coaching ing local norms for individually administered tests.
courses on performance on the MCAT. Journal of School Psychology Review, 13, 491–498.
Medical Education. 61, 273–284. Kamphaus, R. W., & Pleiss, K. L. (1991). Draw-a-
Jones, R. F., & Adams, L. N. (1982). An annotated bib- Person techniques: Tests in search of a construct.
liography of research on the Medical College Admis- Journal of School Psychology, 29, 395–401.
sion Test. Washington, DC: Association of American Kamphaus, R. W., & Reynolds, C. R. (1987). Clinical
Medical Colleges. and research applications of the K-ABC. Circle Pines,
Jones, R. F., & Thomae-Forgues, M. (1984). Validity MN: American Guidance Service,
of the MCAT in predicting performance in the first Kangas, J., & Bradway, K. (1971). Intelligence at
two years of medical school. Journal of Medical Edu- middle-age: A thirty-eight year follow up. Devel-
cation, 59, 455–464. opmental Psychology, 5, 333–337.
Joynson, R. B. (1989). The Burt affair. London: Kanner, A. D., Coyne, J. C., Schaefer, C., & Lazarus,
Routledge. R. S. (1981). Comparison of two modes of stress
Julian, J. Sotile, R. Henry, M. & Sotile, P. (1991) measurement: Daily hassles and uplifts vs. major
The family Apperception Test: Manual. New York: life events. Journal of Behavioral Medicine, 4,
authors. 1–39.
Jung, C. G. (1910). The association method. American Kanner, A. D., Feldman, S. S., Weinberger, D. A., &
Journal of Psychology, 21, 219–235. Ford, M. E. (1987). Uplifts, hassles, and adaptational
Jurg, C. G. (1923). Psychological types. New York: outcomes in early adolescents. Journal of Early Ado-
Harcount, Brace, & Co. lescence, 7, 371–394.
Justen, J. E., & Brown, G. (1977). Definitions of severely Kantor, J. E., Walker, C. E., & Hays, L. (1976). A study
handicapped: A survey of state departments of edu- of the usefulness of Lanyon’s Psychological Screen-
cation. AAESPH Review, 2, 8–14. ing Inventory with adolescents. Journal of Consulting
Kabacoff, R. I., Miller, I. W., Bishop, D. S., Epstein, and Clinical Psychology, 44, 313–316.
N. B., & Keitner, G. I. (1990). A psychome- Kaplan, H. E., & Alatishe, M. (1976). Comparison of
tric study of the McMaster Family Assessment ratings by mothers and teachers on preschool chil-
Device in psychiatric, medical, and nonclinical dren using the Vineland Social Maturity Scale. Psy-
samples. Journal of Family Psychology, 3, 431– chology in the Schools, 13, 27–28.
439. Kaplan, R. (1977). Patterns of environmental prefer-
Kahana, E., Fairchild, T., & Kahana, B. (1982). Adap- ence. Environment and Behavior, 9, 195–216.
tation. In D. J. Mangen & W. A. Peterson (Eds.), Kaplan, R. M. (1982). Nader’s raid on the testing indus-
Research instruments in social gerontology (Vol. 1, try. American Psychologist, 37, 15–23.
pp. 145–193). Minneapolis, MN: University of Kardash, C. A., Amlund, J. T., & Stock, W. A. (1986).
Minnesota Press. Structural analysis of Paivio’s Individual Differences
Kahle, L. (1984). Attitudes and social adaptation. Questionnaire. Journal of Experimental Education,
Oxford, UK: Pergamon. 55, 33–38.
Kahn, M., Fox, H. M., & Rhode, R. (1988), Detecting Karnes, F. A., May, B., & Lee, L. A. (1982). Correlations
faking on the Rorschach: Computer vs. expert clin- between scores on Form A, Form B, and Form A+B
ical judgment. Journal of Personality Assessment, 52, of the Culture Fair Intelligence Test for economi-
516–523. cally disadvantaged students. Psychological Reports,
Kahn, M. W., Fox, H., & Rhode, R. (1990). Detecting 51, 417–418.
faking on the Rorschach: Computer versus expert Karr, S. K., Carvajal, H., & Palmer, B. L. (1992). Com-
clinical judgment. A reply to Cohen. Journal of Per- parison of Kaufman’s short form of the McCarthy
sonality Assessment, 54, 63–66. Scales of Children’s Abilities and the Stanford-Binet
Kahn, R. L., Goldfarb, A. I., Pollack, M., & Peck, A. Intelligence Scales – Fourth Edition. Perceptual and
(1960). Brief objective measures for the determina- Motor Skills, 74, 1120–1122.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

578 References

Kasmar, J. V. (1970). The development of a usable lexi- Kaufman, A. S. (1983). Some questions and answers
con of environmental descriptors. Environment and about the Kaufman Assessment Battery for Children
Behavior, 2, 153–169. (K-ABC). Journal of Psychoeducational Assessment,
Kassin, S. M., & Wrightsman, L. S. (1983). The con- 1, 205–218.
struction and validation of a juror bias scale. Journal Kaufman, A. S. (1990). Assessing adolescent and adult
of Research in Personality, 17, 423–442. intelligence. Boston, MA: Allyn & Bacon.
Kaszniak, A. W. (1989). Psychological assessment of Kaufman, A. S. (2000). Intelligence tests and school
the aging individual. In J. E. Birren & K. W. psychology: predicting the future by studying the
Schaie (Eds.), Handbook of the psychology of aging, past. Psychology in the Schools, 37, 7–16.
(3rd ed., pp. 427–445). New York: Academic Kaufman, A. S., & Applegate, B. (1988). Short forms
Press. of the K-ABC Mental Processing and Achievement
Katz, E. (1955). Success on Stanford-Binet Intelligence scales at ages 4 to 121/2 years for clinical and screen-
Scale test items of children with cerebral palsy as ing purposes. Journal of Clinical Child Psychology,
compared with non-handicapped children. Cerebral 17, 359–369.
Palsy Review, 16, 18–19. Kaufman, A. S., & Hollenbeck, G. P. (1974). Compar-
Katz. J. N., Larson, M. G., Phillips, C. B., Fossel, A. H., ative structure of the WPPSI for blacks and whites.
& Liang, M. H. (1992). Comparative measurement Journal of Clinical Psychology, 30, 316–319.
sensitivity of short and longer health status instru- Kaufman, A. S., Ishikuma, T., & Kaufman-Packer, J. L.
ments. Medical Care, 30, 917–925. (1991). Amazingly short forms of the WAIS-R. Jour-
Katz, L., & Dalby, J. T. (1981). Computer and man- nal of Psychoeducational Assessment, 9, 4–15.
ual administration of the Eysenck Personality Inven- Kaufman, A. S., & Kaufman, N. L. (1975). Social class
tory. Journal of Clinical Psychology, 37, 586–588. differences on the McCarthy Scales for black and
Katz, S., Ford, A. B., Moskowitz, R. W., Jackson, B. A., & white children. Perceptual and Motor Skills, 41, 205–
Jaffe, M. W. (1963). The index of ADL: A standard- 206.
ized measure of biological and psychosocial func- Kaufman, A. S., & Kaufman, N. L. (1977). Clinical eval-
tion. Journal of the American Medical Association. uation of young children with the McCarthy Scales.
185, 914–919. New York: Grune & Stratton.
Kauffman, J. M. (1989). Characteristics of behavior dis- Kaufman, A. S., & Kaufman, N. L. (1983). K-ABC inter-
orders of children and youth (4th ed.). Columbus, pretive manual. Circle Pines, MN: American Guid-
OH: Merrill. ance Service.
Kaufman, A. S. (1972). A short form of the Wechsler Kaufman, G. (1983). How good are imagery
Preschool and Primnary Scale of Intelligence. Jour- questionnaires? A rejoinder to David Marks. Scan-
nal of Consulting and Clinical Psychology, 39, 311– dinavian Journal of Psychology, 24, 247–249.
369. Kazdin, A. E. (1977). Assessing the clinical or applied
Kaufman, A. S. (1973). The relationship of WPPSI importance of behavior change through social vali-
IQs to SES and other background variables. Journal dation. Behavior Modification, 1, 427–452.
of Clinical Psychology, 29, 354–357. Kearns, N. P., Cruickshank. C., McGuigan, K., Riley,
Kaufman, A. S. (1975). Factor structure of the S., Shaw, S., & Snaith, R. (1982). A comparison of
McCarthy Scales at five age levels between 21/2 and depression rating scales. British Journal of Psychiatry,
81/2 . Educational and Psychological Measurement, 35, 141, 45–49.
641–656. Keating, D. P., & MacLean, D. J. (1987). Cognitive pro-
Kaufman, A. S. (1977). A McCarthy short form for cessing, cognitive ability, and development: A recon-
rapid screening of preschool, kindergarten, and sideration. In P. A. Vernon (Ed.), Speed of informa-
first-grade children. Contemporary Educational Psy- tion processing and intelligence. New York: Ablex.
chology, 2, 149–157. Keesling, J. W. (1985). Review of USES General Apti-
Kaufman, A. S. (1979a). Intelligent testing with the tude Test Battery. In J. V. Mitchell, Jr. (Ed.), The ninth
WISC-R. New York: Wiley. mental measurements yearbook (Vol. 2, pp. 1645–
Kaufman, A. S. (1979b). WISC-R research: Impli- 1647). Lincoln, NE: University of Nebraska
cations for interpretation. The School Psychology Press.
Digest, 8, 5–27. Keith, T. (1985). McCarthy Scales of Children’s Abil-
Kaufman, A. S. (1982). An integrated review of ities. In D. Keyser & R. Sweetland (Eds.), Test
almost a decade of research on the McCarthy critiques: Vol. IV. Austin, TX: PRO-ED.
Scales. In T. R. Kratochwill (Ed.), Advances in school Keith, T. Z. (1990). Confirmatory and hierarchical con-
psychology (Vol. II, pp. 119–170). Hillsdale, NJ: firmatory analysis of the Differential Ability Scales.
Erlbaum. Journal of Psychoeducational Assessment, 8, 391–405.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 579

Keith, T. Z., Cool, V. A., Novak, C. G., White, L. J., & Kidder, L. H., Judd, C. M., & Smith, E. R. (1986).
Pottebaum, S. M. (1988). Confirmatory factor anal- Research methods in social relations. New York: Holt,
ysis of the Stanford-Binet Fourth Edition: Testing Rinehart, & Winston.
the theory-test match. Journal of School Psychology, Kiecolt-Glaser, J. K., & Glaser, R. (1986). Psychological
26, 253–274. influences on immunity. Psychosomatics, 27, 621–
Kelleher, D. (1958). The social desirability factor in 625.
Edwards’ PPS. Journal of Consulting Psychology, 22, Kiernan, R. J., Mueller, J., Langston, J. W., & Van Dyke,
100. C. (1987). The Neurobehavioral Cognitive Status
Keller, M. (1972). The oddities of alcoholics. Quarterly Examination: A brief but differentiated approach to
Journal of Studies on Alcohol, 33, 1147–1148. cognitive assessment. Annals of Internal Medicine,
Kelley, M. L. (1985). Review of Child Behavior Check- 107, 481–485.
list. In J. V. Mitchell, Jr. (Ed.), The ninth mental mea- Kilpatrick, F. P., & Cantril, H. (1960). Self-anchoring
surements yearbook (Vol. 1, pp. 301–303). Lincoln, scaling: A measure of individual’s unique reality
NE: University of Nebraska Press. worlds. Journal of Individual Psychology, 16, 158–
Kelley, P. L., Jacobs, R. R., & Farr, J. L. (1994). 173.
Effects of multiple administrations of the MMPI Kilty, K. M., & Feld, A. (1976). Attitudes toward aging
for employee screening. Personnel Psychology, 47, and toward the needs of older people. Journal of
575–591. Gerontology, 31, 586–594.
Kelley, T. L. (1927). Interpretation of educational mea- King, J. D., & Smith, R. A. (1972). Abbreviated forms of
surements. New York: New World Book Company. the Wechsler Preschool and Primary Scale of Intel-
Kelley, T. L. (1928). Crossroads in the mind of man: A ligence for a kindergarten population. Psychological
study of differentiable mental abilities. Stanford, CA: Reports, 30, 539–542.
Stanford University Press. King, M., & King, J. (1971). Some correlates of univer-
Kelley, T. L. (1939). The selection of upper and lower sity performance in a developing country: The case
groups for the validation of test items. Journal of of Ethiopia. Journal of Cross-Cultural Psychology, 2,
Educational Psychology, 30, 17–24. 293–300.
Kelly, T. A. (1990). The role of values in psychotherapy: King, W. C., Jr., & Miles, E. W. (1995). A quasi-
A critical review of process and outcome effects. experimental assessment of the effect of computeriz-
Clinical Psychology Review, 10, 171–186. ing noncognitive paper-and-pencil measurements:
Kemp, D. E., & Stephens, J. H. (1971). Which AB A test of measurement equivalence. Journal of
scale? A comparative analysis of several versions. Applied Psychology, 80, 643–651.
The Journal of Nervous and Mental Disease, 152, Kirk, S. A., McCarthy, J. J., & Kirk, W. D. (1968). Illinois
23–30. Test of Psycholinguistic Abilities. Urbana, IL: Univer-
Keogh, B. K., & Smith, S. E. (1967). Visuo-Motor ability sity of Illinois Press.
for school prediction: A seven-year study. Perceptual Kirk Patrick, E. A. (1900) Individual tests of school
and Motor Skills, 25, 101–110. children. Psychological Review, 7, 274–280.
Kerlinger, F. N. (1964). Foundations of behavioural Kirnan, J. P., & Geisinger, K. F. (1981). The pre-
research. New York: Holt. diction of graduate school success in psychol-
Kerlinger, F. N. (1986). Foundations of behavioral ogy. Educational and Psychological Measurement, 41,
research (3rd ed.). New York: Holt, Rinehart and 815–820.
Winston. Kirnan, J. P., & Geisinger, K. F. (1990). General Apti-
Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple tude Test Battery. In J. Hogan & R. Hogan (Eds.),
regression in behavioral research. New York: Holt, Business and Industry Testing (pp. 140–157). Austin,
Rinehart & Winston. TX: Pro-ed.
Kerr, W. A. (1952). Untangling the Liberalism- Kivela, S. L., & Pahkala, K. (1986). Sex and age dif-
Conservatism continuum. Journal of Social Psychol- ferences in the factor pattern and reliability of the
ogy, 35, 111–125. Zung Self-Rating Depression Scale in a Finnish
Kessler, R. C., & Wethington, E. (1991). The reliability elderly population. Psychological Reports, 59,
of life event reports in a community survey. Psycho- 587–597.
logical Medicine, 21, 723–738. Klee, S. H., & Garfinkel, B. D. (1983). The computer-
Keston, J., & Jimenez, C. (1954). A study of the ized continuous performance task: A new measure
performance on English and Spanish editions of of inattention. Journal of Abnormal Child Psychology,
the Stanford-Binet Intelligence Test by Spanish- 11, 487–496.
American children. Journal of Genetic Psychology, 85, Klein, M. H., Benjamin, L. S., Rosenfeld, R., Treece,
263–269. C., Husted, J., & Greist, J. H. (1993). The Wisconsin
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

580 References

Personality Disorders Inventory: Development, reli- nonagena-rians in the community. Journal of Geron-
ability, and validity. Journal of Personality Disorders, tology, 20, 328–333.
7, 285–303. Klonoff, H., & Kennedy, M. (1966). A comparative
Klein, S. P., & Owens, W. A. (1965). Faking of a scored study of cognitive functioning in old age. Journal of
life history as a function of criterion objectivity. Jour- Gerontology, 21, 239–243.
nal of Applied Psychology, 49, 451–454. Klopfer, B., Ainsworth, M. D., Klopfer, W. G., &
Kleinmuntz, B. (1961). The College maladjustment Holt, R. R. (1954). Developments in the Rorschach
scale (Mt): Norms and predictive validity. Educa- Technique. Yonkers-on-Hudson, NY: World
tional and Psychological Measurement, 21, 1029– Book Co.
1033. Klopfer, W. G. (1973). The short history of projective
Kleinmuntz, B. (1967). Personality measurement. techniques. Journal of the History of the Behavioral
Homewood, IL: The Dorsey Press. Sciences, 9, 60–65.
Kleinmuntz, B. (1969). Personality test interpretation Kluger, A. N., Reilly, R. R., & Russell, C. J. (1991).
by computer and clinician. In J. N. Butcher (Ed.), Faking biodata tests: Are option-keyed instruments
MMPI: Research developments and clinical applica- more resistant? Journal of Applied Psychology, 76,
tions (pp. 97–104). NY: McGraw-Hill. 889–896.
Kleinmuntz, B. (1977). Personality measurement. New Knapp, R. H., & Garbutt, J. J. (1958). Time imagery
York: Krieger. and the achievement motive. Journal of Personality,
Kleinmuntz, B., & Szucko, J. J. (1984). Lie detection in 26, 426–434.
ancient and modern times. American Psychologist, Knapp, R. R. (1960). The effects of time limits on the
39, 766–776. intelligence test performance of Mexican and Amer-
Klieger, D. M., & Franklin, M. E. (1993). Validity of the ican subjects. Journal of Educational Psychology, 51,
Fear Survey Schedule in phobia research: A labora- 14–20.
tory test. Journal of Psychopathology and Behavioral Knapp, R. R. (1976). Handbook for the Personal Ori-
Assessment, 15, 207–217. entation Inventory. San Diego, CA: EdITS.
Klieger, D. M., & McCoy, M. L. (1994). Improving Knapp, R. R., Shostrom, E. L., & Knapp, L.
the concurrent validity of the Fear Survey Schedule- (1978). Assessment of the actualizing person. In P.
Ill. Journal of Psychopathology and Behavioral Assess- McReynolds (Ed.), Advances in psychological assess-
ment, 16, 201–220. ment (Vol. 4, pp. 103–140). San Francisco, CA:
Klimoski, R., & Brickner, M. (1987). Why do assess- Jossey-Bass.
ment centers work? The puzzle of assessment center Knight, B. C., Baker, E. H., & Minder, C. C. (1990).
validity. Personnel Psychology, 40, 243–260. Concurrent validity of the Stanford-Binet: Fourth
Kline, P. (1986). A handbook of test construction. New edition and the Kaufman Assessment Battery for
York: Methuen. Children with learning disabled students. Psychology
Kline, R. B. (1989). Is the Fourth Edition Stanford- in the Schools, 27, 116–125.
Binet a four factor test? Confirmatory factor analyses Knight, H. V., Acosta, L. J., & Anderson, B. D. (1988).
of alternative models for ages 2 through 23. Journal A comparison of textbook and microcomputer
of Psychoeducational Assessment, 7, 4–13. instruction in preparation for the ACT. Journal of
Kline, R. B., Snyder, J., Guilmette, S., & Castellanos, Computer Based Instruction, 15, 83–87.
M. (1993). External validity of the profile variability Knight, R. G., Chisholm, B. J., Marsh, N. V., & God-
index for the K-ABC, Stanford-Binet, and WISC-R: frey, H. P. (1988). Some normative reliability, and
Another cul-de-sac. Journal of Learning Disabilities, factor analytic data for the Revised UCLA Loneli-
26, 557–567. ness Scale. Journal of Clinical Psychology, 44, 203–
Klingler, D. E., Johnson, J. H., & Williams, T. A. (1976). 206.
Strategies in the evaluation of an on-line computer- Knobloch, H., & Pasamanick, B. (Eds.). (1974). Gesell
assisted unit for intake assessment of mental health and Amatruda’s developmental diagnosis (3rd ed.).
patients. Behavior Research Methods and Instrumen- New York: Harper & Row.
tation, 8, 95–100. Knobloch, H., Stevens, F., & Malone, A. F. (1980). Man-
Klingler, D. E., Miller, D. A., Johnson, J. H., & ual of developmental diagnosis. Hagerstown, MD:
Williams, T. A. (1977). Process evaluation of an Harper & Row.
online computer-assisted unit for intake assessment Knoff, H. M. (1989). Review of the Personality Inven-
of mental health patients. Behavior Research Meth- tory for Children. In J. C. Conoley & J. J. Kramer
ods and Instrumentation, 9, 110–116. (Eds.), The tenth mental measurements yearbook
Klonoff, H., & Kennedy, M. (1965). Memory (pp. 625–630). Lincoln, NE: University of Nebraska
and perceptual functioning in octogenarians and Press.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 581

Kobasa, S. C. (1979). Stressful life events, personality Recent advances and remaining issues. Computers
and health: An inquiry into hardiness. Journal of in Human Behavior, 1, 277–291.
Personality and Social Psychology, 37, 1–11. Krause, J. S., & Dawis, R. V. (1992). Prediction of life
Kobasa, S. C., Maddi, S. R., & Kahn, S. (1982). Har- satisfaction after spinal cord injury: A four-year lon-
diness and health: A prospective study. Journal of gitudinal approach. Rehabilitation Psychology, 37,
Personality and Social Psychology, 42, 168–177. 49–60.
Koch, H. L. (1933). Popularity in preschool children: Kravitz, D. A., Cutler, B. L., & Brock, P. (1993). Relia-
Some related factors and a technique for its mea- bility and validity of the original and revised Legal
surement. Child Development, 4, 164–175. Attitudes Questionnaire. Law and Human Behavior,
Koelle, W. H., & Convey, J. J. (1982). The prediction 17, 661–677.
of the achievement of deaf adolescents from self- Krech, D., Crutchfield, R. S., & Ballachey, E. L. (1962).
concept and locus of control measures. American Individual in society. New York: McGraw-Hill.
Annals of the Deaf, 127, 769–779. Kremer, E. F., Atkinson, J. H., Jr., & Ignelzi, R. J. (1982).
Kogan, N., & Wallach. M. A. (1961). Age changes Pain measurement: The affective dimensional mea-
in values and attitudes. Journal of Gerontology, 16, sure of the McGill Pain Questionnaire with a cancer
272–280. pain population. Pain, 12, 153–163.
Kohlberg, L. (1979). The meaning and measurement of Krohn, E. J., & Lamp, R. E. (1989). Concurrent validity
moral development. Worcester, MA: Clark University of the Stanford-Binet Fourth Edition and K-ABC for
Press. Head Start children. Journal of School Psychology, 27,
Komaroff, A. L., Masuda, M., & Holmes, T. H. (1968). 59–67.
The social readjustment rating scale: A compara- Krout, M. H. (1954). An experimental attempt to
tive study of Negro, Mexican and White Americans. produce unconscious manual symbolic movements.
Journal of Psychosomatic Research, 12, 121–128. Journal of General Psychology, 51, 93–120.
Komorita, S. S., & Graham, W. K. (1965). Number of Krug, S. E. (1987). Psychware sourcebook (2nd ed.).
scale points and the reliability of scales. Educational Kansas City, MO: Test Corporation of America.
and Psychological Measurement, 25, 987–995. Krumboltz, J. C., Mitchell, A., & Gelatt, H. G. (1975).
Koppitz, E. M. (1964). The Bender Gestalt Test for young Applications of social learning theory of career selec-
children. New York: Grune & Straton. tion. Focus on Guidance, 8, 1–16.
Koppitz, E. M. (1975). The Bender Gestalt Test for young Kübler-Ross, E. (1969). On death and dying. New York:
children: Vol. 2. Research and applications, 1963– Macmillan.
1973. New York: Grune & Stratton. Kuder, G. F., & Diamond, E. E. (1979). Occupational
Korchin, S. J. (1976). Modern Clinical Psychology. New Interest Survey: General manual (2nd ed.). Chicago,
York: Basic Books. IL: Science Research Associates.
Kornblith, S. J., Greenwald, D. P., Michelson, L., & Kuder, G. F., & Richardson, M. W. (1937). The theory of
Kazdin, A. E. (1984). Measuring the effects of estimation of test reliability. Psychometrika, 2, 151–
demand characteristics on the Beck Depression 160.
Inventory responses of college students. Journal of Kudoh, T., & Nishikawa, M. (1983). A study of the
Behavioral Assessment, 6, 45–49. feeling of loneliness (I): the reliability and validity
Koson, D., Kitchen, C., Kochen, M., & Stodolosky, D. of the revised UCLA Loneliness Scale. The Japanese
(1970). Psychological testing by computer: Effect on Journal of Experimental Social Psychology, 22, 99–
response bias. Educational and Psychological Mea- 108.
surement, 30, 803–810. Kulberg, G. E., & Owens, W. A. (1960). Some life his-
Kosuth, T. F. (1984–1985). The pictorial inventory of tory antecedents of engineering interests. Journal of
careers. Jacksonville, FL: Talent Assessment. Educational Psychology, 51, 26–31.
Kramer, J., Shanks, K., Markely, R., & Ryabik, J. Kulik, J. A., Bangert-Drowns, R. L., & Kulik, C. C.
(1983). The seductive nature of WISC-R short (1984). Effectiveness of coaching for aptitude tests.
forms: An analysis with gifted referrals. Psychology Psychological Bulletin, 95, 179–188.
in the Schools, 20, 137–141. Kumar, V. K., Rehill, K. B., Treadwell, T. W., &
Kranau, E. J., Green, V., & Valencia-Weber, G. (1982). Lambert, P. (1986). The effects of disguising scale
Acculturation and the Hispanic woman: Attitudes purpose on reliability and validity. Measurement
toward women, sex-role attribution, sex-role behav- and Evaluation in Counseling and Development, 18,
ior, and demographics. Hispanic Journal of Behav- 163–167.
ioral Sciences, 4, 21–40. Kunert, K. M. (1969). Psychological concomitants
Kratochwill, T. R., Doll, E. J., & Dickson, W. P. and determinants of vocational choice. Journal of
(1985). Microcomputers in behavioral assessment: Applied Psychology, 53, 152–158.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

582 References

Kuo, W. H. (1984). Prevalence of depression among Lamp, R. E., & Krohn, E. J. (1990). Stability of
Asian-Americans. Journal of Nervous and Mental the Stanford-Binet Fourth Edition and K-ABC for
Disorders, 172, 449–457. young Black and White children from low-income
Kutner, B., Fanshel, D., Togo, A. M., & Langner, T. S. families. Journal of Psychoeducational Assessment, 8,
(1956). Five hundred over sixty. New York: Russell 139–149.
Sage Foundation. Landis, C. (1936). Questionnaires and the study of per-
Labeck. L. J., Johnson, J. H., & Harris, W. G. (1983). sonality. Journal of Nervous and Mental Disease, 83,
Validity of a computerized on-line MMPI interpre- 125–134.
tive system. Journal of Clinical Psychology, 39, 412– Landis, C., Zubin, J., & Katz, S. E. (1935). Empirical
416. evaluation of three personality adjustment invento-
Laboratory of Comparative Human Cognition (1982). ries. Journal of Educational Psychology, 26, 321–330.
Culture and intelligence. In R. J. Sternberg (Ed.), Landy, F. J., Rastegary, H., Thayer, J., & Colvin, C.
Handbook of human intelligence (pp. 642–719). (1991). Time urgency: The construct and its mea-
Cambridge, UK: Cambridge University Press. surement. Journal of Applied Psychology, 76, 644–
Lachar, D. (1974). The MMPI: Clinical assessment and 657.
automated interpretation. Los Angeles, CA: Western Lane, B. (1964). Attitudes of youth toward the
Psychological Services. aged. Journal of Marriage and the Family, 26,
Lachar, D., & Gdowski, C. G. (1979). Actuarial assess- 229–231.
ment of child and adolescent personality: An interpre- Lane, S., Liv, M., Ankenmann, R. D., & Stone, C. A.
tive guide for the Personality Inventory for Children (1996). Generalizability and validity of a mathemat-
profile. Los Angeles, CA: Western Psychological Ser- ics performance assessment. Journal of Educational
vices. Measurement, 33, 71–92.
Lachar, D., & Wrobel, T. A. (1979). Validation of clini- Lang, P. J., & Lazovik, A. P. (1963). Experimental desen-
cians’ hunches: Construction of a new MMPI critical sitization of a phobia. Journal of Abnormal and Social
item set. Journal of Consulting and Clinical Psychol- Psychology, 66, 519–525.
ogy, 47, 277–284. Lankford, J. S., Bell, R. W., & Elias, J. W. (1994). Com-
LaGreca, A. M. (1981). Peer acceptance: The corre- puterized vs. standard personality measures: Equiv-
spondence between children’s sociometric scores alency, computer anxiety, and gender differences.
and teachers’ ratings of peer interactions. Journal Computers in Human Behavior, 10, 497–510.
of Abnormal Child Psychology, 9, 167–178. Lanning, K. (1989). Detection of invalid response
Lah, M. I., & Rotter, J. B. (1981). Changing college patterns on the California Psychological Inventory.
student norms on the Rotter Incomplete Sentences Applied Psychological Measurement, 13, 45–56.
Blank. Journal of Consulting and Clinical Psychology, Lansman, M., Donaldson, G., Hunt, E., & Yantis, S.
49, 985. (1982). Ability factors and cognitive processes. Intel-
Lam, T. C. M. (1993). Testability: A critical issue ligence, 6, 347–386.
in testing language minority students with stan- Lanyon, R. I. (1967). Measurement of social compe-
dardized achievement tests. Measurement and tence in college males. Journal of Consulting Psychol-
Evaluation in Counseling and Development, 26, ogy, 31, 493–498.
179–191. Lanyon, R. I. (1968). A Psychological Screening Inven-
Lamb, R. R., & Prediger, D. J. (1981). Technical report tory. Paper presented to the Eastern Psychological
for the unisex edition of the ACT interest inventory Association, Washington, DC.
(UNIACT). Iowa City, IO: American College Testing Lanyon, R. I. (1973). Manual for the Psychological
Program. Screening Inventory. Port Huron, MI: Research Psy-
Lambert, M. J., Hatch, D. R., Kingston, M. D., & chologists Press.
Edwards, B. C. (1986). Zung, Beck, and Hamilton Lanyon, R. I. (1978). Psychological Screening Inventory:
Rating scales as measures of treatment outcome: A Manual (2nd ed.). Port Huron, MI: Research Psy-
metaanalytic comparison. Journal of Consulting and chologists Press.
Clinical Psychology, 54, 54–59. Lanyon, R. I. (1984). Personality assessment. Annual
Lambert, N. M. (1991). The crisis in measurement lit- Review of Psychology, 35, 667–701.
eracy in psychology and education. Educational Psy- Lanyon, R. I. (1993). Development of scales to assess
chologist, 26, 23–35. specific deception strategies on the Psychological
Lambirth, T. T, Gibb, G. D., & Alcorn, J. D. (1986). Use Screening Inventory. Psychological Assessment, 5,
of a behavior-based personality instrument in avi- 324–329.
ation selection. Educational and Psychological Mea- Lanyon, R. I. (1995). Review of the Imwald Personality
surement, 46, 973–978. Inventory. In J. C. Conoley & J. C. Impara (Eds.),
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 583

The twelfth mental measurements yearbook (pp. 503– based on a principal components analysis of
504). Lincoln, NB: University of Nebraska Press. the WISCV-R. British Journal of Psychology, 76,
Lanyon, R. I., Johnson, J. H., & Overall, J. E. (1974). 35–48.
Factor structure of the Psychological Screening Lawton, M. P. (1972). The dimensions of morale. In
Inventory items in a normal population. Journal of D. Kent, R. Kastenbaum, & S. Sherwood (Eds.),
Consulting and Clinical Psychology, 42, 219–223. Research planning and action for the elderly, (pp. 144–
Laosa, L. M. (1993). Family characteristics as predic- 165). New York: Behavioral Publications.
tors of individual differences in Chicano children’s Lawton, M. P. (1975). The Philadelphia Geriatric Cen-
emergent school readiness. ETS Research Report 93– ter Morale Scale: A revision. Journal of Gerontology,
34. Princeton, NJ: Educational Testing Service. 30, 85–89.
Lasher, K. P., & Faulkender, P. J. (1993). Measure- Lazaridis, E. N., Rudberg, M. A., Furner, S. E., & Cas-
ment of aging anxiety: Development of the Anxiety sel, C, K. (1994). Do Activities of Daily Living have
About Aging Scale. International Journal of Aging a hierarchical structure? An analysis using the lon-
and Human Development, 37, 247–259. gitudinal study of aging. Journal of Gerontology, 49,
Lau, S. (1988). The value orientations of Chinese Uni- 47–51.
versity students in Hong Kong. International Journal Lazarsfeld, P. F. (1950). The logic and mathemati-
of Psychology, 23, 583–596. cal foundation of latent structure analysis. In S. A.
Laughton, J. (1988). Strategies for developing cre- Stouffer, L. Guttman, E. A. Schuman, P. F. Lazarsfeld,
ative abilities of hearing-impaired children. Ameri- S. A. Starr, & J. A. Clausen (Eds.), Measurement and
can Annals of the Deaf, 133, 258–263. prediction, (pp. 362–412). Princeton, NJ: Princeton
Lautenschlager, G. J. (1986). Within-subject measures University Press.
for the assessment of individual differences in fak- Lazarsfeld, P. F. (1954). A conceptual introduction to
ing. Educational and Psychological Measurement, 46, latent structure analysis. In P. F. Lazarsfeld (Ed.),
309–316. Mathematical thinking in the social sciences. Glencoe,
LaVoie, A. L. (1978). Review of the Survey of Interper- IL: Free Press.
sonal Values. In O. K. Buros (Ed.), The eight men- Lazarsfeld, P. F. (1959). Latent structure analysis. In S.
tal measurements yearbook (Vol. 1, pp. 1108–1110). Koch (Ed.), Psychology: A study of science (Vol. 3, pp.
Highland Park, NJ: Gryphon Press. 476–543). New York: McGraw-Hill.
LaVoie, A. L., & Bentler, P. M. (1974). A short form Lazarus, R. S. (1966). Psychological stress and the coping
measure of the expanded seven-dimensional seman- process. New York: McGraw-Hill.
tic space. Journal of Educational Measurement, 11, Lazarus, R. S. (1980). The stress and coping paradigm.
65–66. In C. Eisdorfer, D. Cohen, & A. Kleinman (Eds.),
Lavos, G. (1962). W.I.S.C. psychometric patterns Conceptual models for psychopathology (pp. 173–
among deaf children. Volta Review, 64, 547–552. 209). New York: Spectrum.
Lawler, E. E., III (1967). The multitrait-multirater Leahy, J. M. (1992). Validity and reliability of the Beck
approach to measuring managerial job perfor- Depression Inventory-Short Form in a group of
mance. Journal of Applied Psychology, 51, 369–381. adult bereaved females. Journal of Clinical Psychol-
Lawlis, G. F. (1971). Response styles of a patient pop- ogy, 48, 64–68.
ulation on the Fear Survey Schedule. Behaviour Leckliter, I. N., Matarazzo, J. D., & Silverstein, A. B.
Research and Therapy, 9, 95–102. (1986). A literature review of factor analytic studies
Lawshe, C. H. (1952). What can industrial psychology of the WAIS-R. Journal of Clinical Psychology, 42,
do for small business? Personnel Psychology, 5, 31– 332–342.
34. LecLercq, D. A., & Bruno, J. E. (1993). Item banking:
Lawshe, C. H. (1985). Inferences from personnel tests Interactive testing and self-assessment. Berlin:
and their validity. Journal of Applied Psychology, 70, Springer-Verlag.
237–238. Lee, H. B., & Oei, T. P. S. (1994). Factor structure,
Lawshe, C. H., Kephart, N. C., & McCormick, E. J. validity, and reliability of the Fear Questionnaire in
(1949). The paired comparison technique for rat- a Hong Kong Chinese population. Journal of Psy-
ing performance of industrial employees. Journal of chopathology and Behavioral Assessment, 16, 189–
Applied Psychology, 33, 69–77. 199.
Lawshe, C. H., & Steinberg, M. D. (1955). Studies in Lee, L. (1971). Northwestern Syntax Screening Test.
synthetic validity. I. An exploratory investigation of Evanston, IL: Northwestern University Press.
clerical jobs. Personnel Psychology, 8, 291–301. Lees-Haley, P. R. (1992). Psychodiagnostic test usage by
Lawson, J. S., & Inglis, J. (1985). Learning dis- forensic psychologists. American Journal of Forensic
abilities and intelligence test results: A model Psychology, 10, 25–30.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

584 References

Lefcourt, H. M. (1966). Internal vs. external control of Levonian, E. (1961). A statistical analysis of the 16 Per-
reinforcement: A review. Psychological Bulletin, 65, sonality Factor Questionnaire. Educational and Psy-
206–220. chological Measurement, 21, 589–596.
Lefcourt, H. M. (1976). Locus of control: Current trends Levonian, E., Comrey, A., Levy, W., & Procter, D.
in theory and research. Hillsdale, NJ: Erlbaum. (1959). A statistical evaluation of Edwards Personal
Lei, H., & Skinner, H. A. (1980). A psychometric study Preference Schedule. Journal of Applied Psychology,
of life events and social readjustment. Journal of Psy- 43, 355–359.
chosomatic Research, 24, 57–66. Levy, S. (1982). Use of the Peabody Picture Vocabulary
Leichsenring, F. (1999). Development and first results Test with low-functioning autistic children. Psychol-
of the Borderline Personality Inventory: A self- ogy in the Schools, 19, 24–27.
report instrument for assessing borderline person- Lewandowski, D. G., & Saccuzzo, D. P. (1975). Possible
ality organization. Journal of Personality Assessment, differential WISC patterns for retarded delinquents.
73, 45–63. Psychological Reports, 37, 887–894.
Leigh, I. W., Robins, C. J., Welkowitz, J., & Bond, R. N. Lewinsohn, P. M. (1976). Activity schedules in treat-
(1989). Toward greater understanding of depression ment of depression. In J. D. Krumboltz & C. E.
in deaf individuals. American Annals of the Deaf, 134, Thoresen (Eds.), Counseling methods (pp. 74–83).
249–254. New York: Holt, Rinehart and Winston.
Leighton, A. H. (1959). Mental illness and accultura- Lewinsohn, P. M., & Graf, M. (1973). Pleasant activities
tion. In I. Goldston (Ed.), Medicine and anthropology and depression. Journal of Consulting and Clinical
(pp. 108–128). New York: International University Psychology, 41, 261–268.
Press. Lewinsohn, P. M., Mermelstein, R. M., Alexander, C.,
Leiter, R. G. (1952). Leiter International Performance & MacPhillamy, D. J. (1985). The Unpleasant Events
Scale. Chicago, IL: Stoelting. Schedule: A scale for the measurement of aversive
Lerner, B. (1981). The minimum competence testing events. Journal of Clinical Psychology, 41, 483–498.
movement: Social, scientific, and legal implications. Lezak, M. (1983). Neuropsychological Assessment (2nd
American Psychologist, 36, 1057–1066. ed.). New York: Oxford University Press.
Lerner, H., & Lerner, P. M. (1987). Rorschach Inkblot Lezak, M. D. (1988). IQ: R.I.P. Journal of Clinical and
Test. In D. J. Keyser & R. C. Sweetland (Eds.), Test Experimental Neuropsychology, 10, 351–361.
critiques compendium (pp. 372–401). Kansas City, Libbrecht, K., & Quackelbeen, J. (1995). On the early
MO: Test Corporation of America. history of male hysteria and psychic trauma. Journal
Lesser, G. (1959). Population differences in con- of the History of the Behavioral Sciences, 31, 370–
struct validity. Journal of Consulting Psychology, 23, 384.
60–65. Lichtenstein, E., & Bryan, J. H. (1965). Acquiescence
Lettieri, D. J., Nelson, J. E., & Sayers, M. A. (1985). Alco- and the MMPI: An item reversal approach. Journal
holism treatment assessment research instruments. of Abnormal Psychology, 70, 290–293.
Washington, DC.: Metrotec. Lichtenstein, R., & Ireton, H. (1984). Preschool screen-
Levenson, H. (1973). Perceived parental antecedents ing. Orlando, FL: Grune & Stratton.
of internal, powerful others, and chance locus of Liebert, R. M., & Morris, L. W. (1967). Cognitive and
control orientations. Developmental Psychology, 9, emotional components of test anxiety: A distinction
260–265. and some initial data. Psychological Reports, 20, 975–
Levenson, H. (1974). Activism and powerful others: 978.
Distinctions within the concept of internal-external Lightfoot, S. L., & Oliver, J. M. (1985). The Beck
control. Journal of Personality Assessment, 38, 377– Inventory: Psychometric properties in university
383. students. Journal of Personality Assessment, 49, 434–
Leventhal, A. M. (1966). An anxiety scale for the CPI. 436.
Journal of Clinical Psychology, 22, 459–461. Likert, R. (1932). A technique for the measurement of
Levine, E. S. (1971). Mental assessment of the deaf attitudes. Archives of Psychology, No. 140, 1–55.
child. The Volta Review, 73, 80–105. Linde, T., & Patterson, C. H. (1958). The MMPI in
Levine, E. S. (1974). Psychological tests and practices cerebral palsy. Journal of Consulting Psychology, 22,
with the deaf: A survey of the state of the art. The 210–212.
Volta Review, 76, 298–319. Lindzey, G. (1959). On the classification of projective
Levine, M., & Wishner, J. (1977). The case records of techniques. Psychological Bulletin, 56, 158–168.
the psychological clinic at the University of Penn- Linn, M. R. (1993). A brief history for counselors . . .
sylvania (1896–1961). Journal of the History of the college entrance examinations in the United States.
Behavioral Sciences, 13, 59–66. The Journal of College Admission, 140, 6–16.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 585

Linn, R. L. (1986). Educational testing and assessment. Lohmann, N. (1977). Correlations of life satisfaction,
Research needs and policy issues. American Psychol- morale, and adjustment measures. Journal of Geron-
ogist, 41, 1153–1160. tology, 32, 73–75.
Linn, R. L., & Dunbar, S. B. (1992). Issues in the design Longstaff, H. P., & Jurgensen, C. E. (1953). Fakability
and reporting of the National Assessment of Educa- of the Jurgensen Classification Inventory. Journal of
tional Progress. Journal of Educational Measurement, Applied Psychology, 37, 86–89.
29, 177–194. Lord, F. M. (1944). Reliability of multiple-choice tests
Linn, R. L., Grave, M. E., & Sanders, N. M. (1990). as a function of number of choices per item. Journal
Comparing state and district test results to national of Educational Psychology, 35, 175–180.
norms: The validity of the claims that “everyone Lord, F. M. (1952). The relation of the reliability of
is above average.” Educational Measurement: Issues multiple-choice tests to the distribution of item dif-
and Practice, 9, 5–14. ficulties. Psychometrika, 17, 181–194.
Linn, R. L., & Harnisch, D. L. (1981). Interactions Lord, F. M. (1980). Applications of item response theory
between item content and group membership on to practical testing problems. Hillsdale, NJ: Erlbaum.
achievement test items. Journal of Educational Mea- Lord, F. M., & Novick, M. R. (Eds.). (1968). Statistical
surement, 18, 109–118. theories of mental test scores. New York: Addison-
Linn, R. L., Harnisch, D. L., & Dunbar, S. B. (1981). Wesley.
Validity generalization and situational specificity: Lorr, M., Daston, P., & Smith, I. R. (1967). An anal-
An analysis of the prediction of first-year grades in ysis of mood states. Educational and Psychological
law school. Applied Psychological Measurement, 5, Measurement, 27, 89–96.
281–289. Lovie, A. D., & Lovie. P. (1993). Charles Spearman,
Lipsitt, L. P. (1958). A self-concept scale for children Cyril Burt, and the origins of factor analysis. Jour-
and its relationship to the children’s form of the nal of the History of the Behavioral Sciences, 29,
Manifest Anxiety Scale. Child Development, 29, 463– 308–321.
473. Lowe, N. K., & Ryan-Wenger, N. M. (1992). Beyond
Lissitz, R. W., & Willhoft, J. L. (1985). A methodologi- Campbell and Fiske: Assessment of convergent
cal study of the Torrance Tests of Creativity. Journal and discriminant validity. Research in Nursing and
of Educational Measurement, 22, 1–11. Health, 15, 67–75.
Littell, W. M. (1960). The Wechsler Intelligence Scale Loyd, B. H. (1988). Implications of item response the-
for Children: Review of a decade of research. Psy- ory for the measurement practitioner. Applied Mea-
chological Bulletin, 57, 132–156. surement in Education, 1, 135–143.
Livneh, H., & Livneh, C. (1989). The five-factor model Lubin, B. (1965). Adjective checklists for the measure-
of personality: Is evidence of its cross-measure valid- ment of depression. Archives of General Psychology,
ity premature? Personality and Individual Differ- 12, 57–62.
ences, 10, 75–80. Lubin, B. (1967). Manual for the depression adjective
Locke, H. J., & Wallace, K. M. (1959). Short mari- check lists. San Diego, CA: Education and Industrial
tal adjustment and prediction tests: Their reliability Testing Service.
and validity. Marriage and Family Living, 21, 251– Lubin, B., Larsen, R. M., & Matarazzo, J. D. (1984).
255. Patterns of psychological test usage in the United
Loeb, R., & Sarigiani, P. (1986). The impact of hear- States: 1935–1982. American Psychologist, 39, 451–
ing impairment on self perceptions of children. The 454.
Volta Review, 88, 89–101. Lubin, B., Wallis, R. R., & Paine, C. (1971). Patterns of
Loevinger, J. (1957). Objective tests as instruments of psychological test usage in the United States: 1935–
psychological theory. Psychological Reports, 3, 635– 1969. Professional Psychology, 2, 70–74.
694. Lucas, R. W., Mullin, P. J., Luna, C. B. X., & McInroy,
Loevinger, J. (1972). Some limitations of objective D. C. (1977). Psychiatrists and a computer as inter-
personality tests. In J. N. Butcher (Ed.), Objec- rogators of patients with alcohol-related illnesses: A
tive personality assessment. New York: Academic comparison. British Journal of Psychiatry, 131, 160–
Press. 167.
Loevinger, J. (1976). Ego development. San Francisco, Ludenia, K., & Donham, G. (1983). Dental outpatients:
CA: Jossey-Bass. Health locus of control correlates. Journal of Clinical
Loevinger, J. (1979). Construct validity of the Psychology, 39, 854–858.
sentence completion test of ego development. Lundy, A. (1988). Instructional set and Thematic
Applied Psychological Measurement, 3, 281– Apperception Test validity. Journal of Personality
311. Assessment, 52, 309–320.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

586 References

Lunneborg, C. E., & Lunneborg, P. W. (1967). Pattern Mallory, E., & Miller, V. (1958). A possible basis for the
prediction of academic success. Educational and Psy- association of voice characteristics and personality
chological Measurement, 27, 945–952. traits. Speech Monographs, 25, 255–260.
Lunneborg, P. W. (1979). The Vocational Interest Maloney, M. P., & Ward, M. P. (1976). Psychologi-
Inventory: Development and validation. Edu- cal assessment. A conceptual approach. New York:
cational and Psychological Measurement, 39, Oxford University Press.
445–451. Mann, I. T., Phillips, J. L., & Thompson, E. G. (1979).
Luria, A. R. (1966). Higher cortical functions in man. An examination of methodological issues relevant
New York: Basic Books. to the use and interpretation of the Semantic
Luria, A. R. (1973). The working brain. New York: Basic Differential. Applied Psychological Measurement, 3,
Books. 213–229.
Lyman, H. B. (1978). Test scores and what they mean Mapou, A. (1955). Development of general working
(3rd ed.). Englewood Cliffs, NJ: Prentice Hall. population norms for the USES General Aptitude
Lynn, M. R. (1986). Determination and quantifi- Test Battery. Journal of Applied Psychology, 39, 130–
cation of content validity. Nursing Research, 35, 133.
382–385. Mardell, C., & Goldenberg, D. (1975). Developmen-
Lytton, H., Watts, D., & Dunn, B. E. (1986). Stabil- tal Indicators for the Assessment of Learning (DIAL).
ity and predictability of cognitive and social char- Edison, NJ: Childcraft Education Corp.
acteristics from age 2 to age 9. Genetic Psychology Marin, G., Gamba, R. J., & Marin, B. V. (1992). Extreme
Monographs, 112, 363–398. response style and acquiescence among Hispanics.
MacAndrew, C. (1965). The differentiation of male Journal of Cross-cultural Psychology, 23, 498–509.
alcoholic out-patients from nonalcoholic psychi- Marin, G., Sabogal, F., Marin, B. V., Otero-Sabogal, R.,
atric patients by means of the MMPI. Quarterly Jour- & Perez-Stable, E. J. (1987). Development of a short
nal of Studies on Alcohol, 26, 238–246. acculturation scale for Hispanics. Hispanic Journal
Maccoby, E. E., & Jacklin, C. N. (1974). The psy- of Behavioral Sciences, 9, 183–205.
chology of sex differences. Stanford, CA: United Maris, R. W. (1992). Overview of the study of suicide
Press. assessment and prediction. In R. W. Maris, A. L.
MacFarlane, J. W., & Tuddenham, R. D. (1951). Prob- Berman, J. T. Maltsberger, & R. I. Yufit (Eds.), Assess-
lems in the validation of projective techniques. In ment and prediction of suicide. New York: Guilford
H. H. Anderson, & G. L. Anderson (Eds.), An intro- Press.
duction to projective techniques (pp. 26–54). Engle- Marks, D. F. (1973). Visual imagery differences in the
wood Cliffs, NJ: Prentice-Hall. recall of pictures. British Journal of Psychology, 64,
Machover, K. (1949). Personality projection in the draw- 17–24.
ing of the human figure: A method of personality inves- Marks, I. M., & Mathews, A. M. (1979). Brief standard
tigation. Springfield, IL: Charles C Thomas. self-rating for phobic patients. Behaviour Research
MacPhillamy, D. J., & Lewinsohn, P. M. (1976). Man- and Therapy, 17, 263–267.
ual for the pleasant events schedule. Unpublished Marks, P. A., & Seeman, W. (1963). The actuarial
manuscript, University of Oregon. description of personality: An atlas for use with the
Madianos, M. G., Gournas, G., & Stefanis, C. N. (1992). MMPI. Baltimore, MD: Williams & Wilkins.
Depressive symptoms and depression among elderly Marks, P. A., Seeman, W., & Haller, D. L. (1974).
people in Athens. Acta Psychiatrica Scandinavica, 86, The actuarial use of the MMPI with adolescents and
320–326. adults. Baltimore, MD: Williams & Wilkins.
Mael, F. A. (1991). A conceptual rationale for the Marland, S. (1972). Education of the gifted and talented,
domain and attributes of biodata items. Personnel I. Report to Congress of the United States by the Com-
Psychology, 44, 763–792. missioner of Education. Washington, DC: U.S. Office
Mahakian, C. (1939). Measuring the intelligence and of Education.
reading capacity of Spanish-speaking children. Ele- Marquart, D. I., & Bailey, L. L. (1955). An evaluation of
mentary School Journal, 39, 760–768. the culture-free test of intelligence. Journal of Genetic
Mahon, N. E., & Yarcheski, A. (1990). The dimension- Psychology, 86, 353–358.
ality of the UCLA Loneliness Scale. Research in Nurs- Marquart, J. M. (1989). A pattern matching approach
ing and Health, 13, 45–52. to assess the construct validity of an evaluation
Malgady, R. G., Costantino, G., & Rogler, L. H. (1984). instrument. Evaluation and Program Planning, 12,
Development of a Thematic Apperception Test for 37–43.
urban Hispanic children. Journal of Consulting and Marsh, H. W., & Richards, G. E. (1988). Tennessee Self-
Clinical Psychology, 52, 986–996. Concept Scale: Reliability, internal structure, and
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 587

construct validity. Journal of Personality and Social Matarazzo, J. D. (1980). Behavioral health and behav-
Psychology, 55, 612–624. ioral medicine: Frontiers for a new health psychol-
Marshall, V. W. (1982). Death and dying. In D. J. Man- ogy. American Psychologist, 35, 807–817.
gen & W. A. Peterson (Eds.), Research instruments in Matarazzo, J. D. (1983). Computerized psychological
social gerontology (Vol. 1, pp. 303–381). Minneapo- testing. Science, 221, 323.
lis, MN: University of Minnesota Press. Matarazzo, J. D. (1986). Computerized clinical psy-
Marston, A. R. (1971). It is time to reconsider the chological test interpretations. American Psycholo-
Graduate Record Examination. American Psychol- gist, 41, 14–24.
ogist, 26, 653–655. Matarazzo, J. D. (1992). Psychological testing and
Martin, D. (1986). Is my child gifted?: A guide for caring assessment in the 21st century. American Psychol-
parents. Springfield, IL: Charles C Thomas. ogist, 47, 1007–1018.
Martin, R. P. (1986). Assessment of the social and emo- Matarazzo, J. D., & Wiens, A. N. (1977). Black Intel-
tional functioning of preschool children. School Psy- ligence Test of Cultural Homogeneity and Wechsler
chology Review, 15, 216–232. Adult Intelligence Scale scores of black and white
Martin, R. P. (1988). Assessment of personality and police applicants. Journal of Applied Psychology, 62,
behavior problems. New York: Guilford Press. 57–63.
Martinez, R., Norman, R. D., & Delaney, N. D. (1984). Matheny, A. P., Jr. (1980). Bayley’s infant behavior
A Children’s Hispanic Background Scale. Hispanic record: Behavioral components and twin analyses.
Journal of Behavioral Sciences, 6, 103–112. Child Development, 51, 1157–1167.
Marx, M. B., Garrity, T. F., & Bowers, F. R. (1975). The Mathis, H. (1968). Relating environmental factors to
influence of recent life experience on the health of aptitude and race. Journal of Counseling Psychology,
college freshmen. Journal of Psychosomatic Research, 15, 563–568.
19, 87. Matluck, J. H., & Mace, B. J. (1973). Language char-
Maslach, C., Jackson, S. E. (1981). Maslach Burnout acteristics of Mexican-American children: Implica-
Inventory. Palo Alto, CA: Consulting Psychologists tions for assessment. Journal of School Psychology,
Press, Inc. 11, 365–386.
Masling, J. (1960). The influence of situational and Mattis, S. (1976). Mental status examination for
interpersonal variables in projective testing. Psycho- organic mental syndrome in the elderly patient. In
logical Bulletin, 57, 65–85. L. Bellak & T. B. Karasu (Eds.), Geriatric psychiatry
Maslow, A. H. (1954). Motivation and personality. New (pp. 77–121). New York: Grune & Stratton.
York: Harper & Row. Mauger, P. A., & Kolmodin, C. A. (1975). Long-
Maslow, A. H. (1962). Toward a psychology of being. term predictive validity of the Scholastic Aptitude
Princeton, NJ: D. VanNostrand. Test. Journal of Educational Psychology, 67, 847–
Maslow, A. (1970). Motivation and personality. New 851.
York: Harper & Row. May, K. M., & Loyd, B. H. (1994). Honesty tests
Maslow, P., Frostig, M., Lefever, W., & Whittlesey, in academia and business: A comparative study.
J. R. B. (1964). The Marianne Frostig Developmen- Research in Higher Education, 35, 499–511.
tal Test of Visual perception, 1963 standardization. May, R. (1966). Sex differences in fantasy patterns.
Perceptual and Motor Skills, 19, 463–499. Journal of Projective Techniques and Personality
Mason, E. M. (1992). Percent of agreement among Assessment, 30, 576–586.
raters and rater reliability of the Copying sub- McAbee, T. A., & Cafferty, T. P. (1982). Degree of
test of the Stanford-Binet Intelligence Scale: prescribed punishment as a function of subjects’
Fourth edition. Perceptual and Motor Skills, 74, authoritarianism and offenders’ race and social sta-
347–353. tus. Psychological Reports, 50, 651–654.
Masters, J. R. (1974). The relationship between num- McAdams, D. P. (1992). The five-factor model in
ber of response castegories and reliability of Likert- personality: A critical appraisal. Journal of Person-
type questionnaires. Journal of Educational Mea- ality, 60, 329–361.
surement, 11, 49–53. McAllister, L. W. (1986). A practical guide to CPI inter-
Masuda, M., & Holmes, T. H. (1967). The social read- pretation. Palo Alto, CA: Consulting Psychologists
justment rating scale: A cross cultural study of Press.
Japanese and Americans. Journal of Psychosomatic McAndrew, C. (1965). The differentiation of male
Research, 11, 227–237. alcoholic outpatients from nonalcoholic psychi-
Matarazzo, J. D. (1972). Wechsler’s measurement and atric outpatients by means of the MMPI. Quar-
appraisal of adult intelligence (5th ed.). Baltimore, terly Journal of Studies on Alcohol, 26, 238–
MD: Williams & Wilkins. 246.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

588 References

McArthur, D. S., & Roberts, G. E. (1982). Roberts ual and homosexual males. Behaviour Research and
Apperception Test for Children: Manual. Los Angeles, Therapy, 5, 43–48.
CA: Western Psychological Services. McCormack, R. L. (1983). Bias in the validity of
McCabe, S. P. (1985). Career Assessment Inventory. predicted college grades in four ethnic minority
In D. J. Keyser & R. C. Sweetland (Eds.), Test cri- groups. Educational and Psychological Measurement,
tiques (Vol. II, pp. 128–137). Kansas City, MO: Test 43, 517–522.
Corporation of America. McCrae, R. R. (1982). Consensual validation of per-
McCallum, R. S., Karnes, F., & Crowell, M. (1988). Fac- sonality traits: Evidence from self-reports and rat-
tor structure of the Stanford-Binet Intelligence Scale ings. Journal of Personality and Social Psychology, 43,
for gifted children (4th ed.). Contemporary Educa- 293–303.
tional Psychology, 13, 331–338. McCrae, R., & Costa, P. (1983a). Social desirabil-
McCallum, R. S. (1990). Determining the factor struc- ity scales: More substance than style. Journal
ture of the Stanford-Binet – Fourth Edition: The of Consulting and Clinical Psychology, 51, 882–
right choice. Journal of Psychoeducational Assess- 888.
ment, 8, 436–442. McCrae, R. R., & Costa, P. T., Jr. (1983b). Joint factors
McCallum, R. S., Karnes, F. A., & Edwards, R. P. (1984). in self-reports and ratings: Neuroticism, extraver-
The test of choice for assessment of gifted children: A sion, and openness to experience. Personality and
comparison of the K-ABC, WISC-R, and Stanford- Individual Differences, 4, 245–255.
Binet. Journal of Psychoeducational Assessment, 2, McCrae, R. R., & Costa, P. T., Jr. (1985). Updating
57–63. Norman’s “Adequate Taxonomy”: Intelligence and
McCann, J. T. (1991). Convergent and discrimi- personality dimensions in natural language and in
nant validity of the MCMI-II and MMPI person- questionnaires. Journal of Personality and Social Psy-
ality disorder scales. Psychological Assessment, 3, chology, 49, 710–721.
9–18. McCrae, R. R., & Costa, P. T., Jr. (1986). Clinical assess-
McCarney, S. B. (1989). The Attention Deficit Disor- ment can benefit from recent advances in person-
ders Evaluation Scale technical manual, school ver- ality psychology. American Psychologist, 41, 1001–
sion. Columbia, MO: Hawthorne Educational Ser- 1003.
vices, Inc. McCrae, R. R., & Costa, P. T., Jr. (1987). Validation of
McCarthy, B. W., & Rafferty, J. E. (1971). Effect of social the five-factor model of personality across instru-
desirability and self-concept scores on the measure- ments and observers. Journal of Personality and
ment of adjustment. Journal of Personality Assess- Social Psychology, 52, 81–90.
ment, 35, 576–583. McCrae, R. R., & Costa, P. T., Jr. (1989a). Reinter-
McCarthy, D. (1972). Manual for the McCarthy Scales preting the Myers-Briggs Type Indicator from the
of Children’s Abilities. New York: Psychological Cor- perspective of the five-factor model of personality.
poration. Journal of Personality, 57, 17–40.
McCarthy, D. (1978). McCarthy Screening Test. New McCrae, R. R., & Costa, P. T., Jr. (1989b). More reasons
York: Psychological Corporation. to adopt the five-factor model. American Psycholo-
McCaulley, M. H. (1981). Jung’s theory of psycho- gist, 44, 451–452.
logical types and the Myers-Briggs Type Indicator. McCrae, R. R., Costa, P. T., Jr., & Busch, C. M.
In P. McReynolds (Ed.), Advances in psychological (1986). Evaluating comprehensiveness in personal-
assessment (Vol. 5, pp. 294–352). San Francisco, CA: ity systems: The California Q-set and the five factor
Jossey-Bass. model. Journal of Personality, 54, 430–446.
McClelland, D. C. (1951). Personality. New York: McCrone, W. P., & Chambers, J. F. (1977). A national
Sloane. pilot study of psychological evaluation services to
McClelland, D. C., Atkinson, J. W., Clark, R. A., & deaf vocational rehabilitation clients. Journal of
Lowell, E. L. (1953). The achievement motive. New Rehabilitation of the Deaf, 11, 1–4.
York: Appleton-Century-Crofts. McDermid, C. D. (1965). Some correlates of creativity
McClelland, D. C., & Boyatzis, R. E. (1982). Leadership in engineering personnel. Journal of Applied Psychol-
motive pattern and long-term success in manage- ogy, 49, 14–19.
ment. Journal of Applied Psychology, 67, 737–743. McDougall, W. (1932). Of the words character and
McClenaghan, B., & Gallahue, D. (1978). Fundamental personality. Character Personality, 1, 3–16.
movement: A developmental and remedial approach. McFall, R. M., & Lillesand, D. V. (1971). Behavior
Philadelphia, PA: Saunders. rehearsal with modeling and coaching in assertive
McConaghy, N. (1967). Penile volume change to mov- training. Journal of Abnormal Psychology, 77, 313–
ing pictures of male and female nudes in heterosex- 323.
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

References 589

McFie, J. (1975). Assessment of organic intellectual McQuitty, L. (1957). Elementary linkage analysis for
impairment. London: Academic Press. isolating orthogonal and oblique types and typical
McGee, R. K. (1967). Response set in relation to relevancies. Educational and Psychological Measure-
personality: an orientation. In I. A. Berg (Ed.), ment, 17, 207–229.
Response set in personality assessment (pp. 1–31). McReynolds, P. (1987). Lightner Witmer. American
Chicago, IL: Aldine. Psychologist, 42, 849–858.
McGill, J. C. (1980). MMPI score differences among McReynolds, P., & Ludwig, K. (1984). Christian
Anglo, Black, and Mexican-American welfare recip- Thomasius and the origin of psychological rating
ients. Journal of Clinical Psychology, 36, 147–151. scales. ISIS, 75, 546–553.
McGreal, R., & Joseph, S. (1993). The Depression- McReynolds, P., & Ludwig, K. (1987). On the history of
Happiness Scale. Psychological Reports, 73, 1279– rating scales. Personality and Individual Differences,
1282. 8, 281–283.
McGuire, F. L. (1980). The new MCAT and medical McShane, D. A. (1980). A review of scores of American
school performance. Journal of Medical Education, Indian children on the Wechsler Intelligence Scales.
55, 405–408. White Cloud Journal, 1, 3–10.
McHenry, J. J., Hough, L. M., Toquam, J. L., Han- McShane, D. A., & Plas, J. M. (1982). Wechsler scale
son, M. A., & Ashworth, S. (1990). Project A validity performance patterns of American Indian children.
results: The relationship between predictor and cri- Psychology in the Schools, 19, 8–17.
terion domains. Personnel Psychology, 43, 335–354. McShane, S. L., & Karp, J. (1993). Employ-
McHorney, C. A., Ware, J. E., Jr., & Raczek, A. E. (1993). ment following spinal cord injury: A covariance
The MOS 36-item short-form Health Survey (SF- structure analysis. Rehabilitation Psychology, 38,
36) II: Psychometric and clinical tests of validity in 27–40.
measuring physical and mental health constructs. Meador, D., Livesay, K. K., & Finn, M. H. (1983). Study
Medical Care, 31, 247–263. Number 27. In A. S. Kaufman & N. L. Kaufman
McIntosh, D. E. (1999). Identifying at-risk (Eds.), Kaufman-Assessment Battery for Children:
preschoolers: The discriminant validity of the Interpretive manual. Circle Pines, MN: American
Differential Ability Scales. Psychology in the Schools, Guidance Service.
36, 1–10. Meadow, K. P. (1983). An instrument for assessment of
McKee, M. G. (1972). Review of the EPPS. In O. K. social-emotional adjustment in hearing-impaired
Buros (Ed.), The seventh mental measurements Year- preschoolers. American Annals of the Deaf, 128, 826–
book (Vol. 1, pp. 1459–151). Highland Park, NJ: 884.
Gryphon Press. Medley, M. L. (1980). Life satisfaction across four
McKechnie, G. E. (1975). Manual for the Leisure Activ- stages of adult life. International Journal of Aging
ities Blank. Palo Alto, CA: Consulting Psychologists and Human Development, 11, 192–220.
Press. Meehl, P. E. (1954). Clinical versus statistical prediction.
McKinney, W. R. (1987). Public personnel selection: Minneapolis, MN: University of Minnesota Press.
Issues and choice points. Public Personnel Manage- Meehl, P. E. (1956). Wanted – A good cookbook. Amer-
ment Journal, 16, 243–257. ican Psychologist, 11, 263–272.
McLain, D. L. (1993). The MSTAT-I: A new measure of Meehl, P. E. (1957). When shall we use our heads
an individual’s tolerance for ambiguity. Educational instead of the formula? Journal of Counseling Psy-
and Psychological Measurement, 53, 183–189. chology, 4, 268–273.
McMahon, R. J. (1984). Behavioral checklists and rat- Meehl, P. E. (1978). Theoretical risks and tabular
ing scales. In T. H. Ollendick & M. Hersen (Eds.), asterisks: Sir Karl, Sir Ronald, and the slow progress
Child behavioral assessment: Principles and proce- of soft psychology. Journal of Consulting and Clinical
dures. New York: Pergamon Press. Psychology, 46, 806–834.
McNemar, Q. (1940). A critical examination of the Meehl, P. E., & Hathaway, S. R. (1946). The K factor as a
University of Iowa studies of environmental influ- suppressor variable in the MMPI. Journal of Applied
ences upon the IQ. Psychological Bulletin, 37, 63–92. Psychology, 30, 525–564.
McNemar, Q. (1950). On abbreviated Wechsler- Meehl, P. E., Lykken, D. T., Schofield, W., & Telle-
Bellevue scales. Journal of Consulting Psychology, 14, gen, A. (1971). Recaptured-item technique (RIT):
79–81. A method for reducing somewhat the subjective
McQuaid, M. F., & Alovisetti, M. (1981). School psy- element in factor naming. Journal of Experimental
chological services for hearing-impaired children Research in Personality, 5, 171–190.
in the New York and New England area. American Meehl, P. E., & Rosen, A. (1955). Antecedent prob-
Annals of the Deaf, 126, 37–42. ability and the efficiency of psychometric signs,
P1: JZP
0521861810rfa2 CB1038/Domino 0 521 86181 0 March 6, 2006 13:35

590 References

patterns, or cutting scores. Psychological Bulletin, 52, ner (Eds.), Practical intelligence (pp. 307–337). Cam-
194–216. bridge, MA: Cambridge University Press.
Meeker, M. (1985). A teacher’s guide for the Structure Mercer, J., & Lewis, J. (1977). System of multicul-
of Intellect Learning Abilities Test. Los Angeles, CA: tural pluralistic assessment. New York: Psychological
Western Psychological Services. Corporation.
Meeker, M., Meeker, R. J., & Roid, G. H. (1985). Merenda, P. F. (1958). Navy basic test battery validity
Structure of intellect Learning Abilities Test (SOI-LA) for success in naval occupations. Personnel Psychol-
manual. Los Angeles, CA: Western Psychological ogy, 11, 567–577.
Services. Merenda, P. F., & Reilly, R. (1971). Validation of selec-
Meer, B., & Baker, J. A. (1965). Reliability of mea- tion criteria in determining success of graduate stu-
surements of intellectual functioning of geriatric dents in psychology. Psychological Reports, 28, 259–
patients. Journal of Gerontology, 20, 110–114. 266.
Megargee, E. I. (1972). The California psychological Merrell, K. W. (1990). Teacher ratings of hyperactivity
inventory handbook. San Francisco, CA: Jossey-Bass. and self-control in learning disabled boys: A com-
Mehryar, A. H., Hekmat, H., & Khajavi, F. (1977). parison with low achieving and average peers. Psy-
Some personality correlates of contemplated sui- chology in the Schools, 27, 289–296.
cide. Psychological Reports, 40, 1291–1294. Merrell, K. W. (1994). Assessment of behavioral, social,
Meier, S. T. (1993). Revitalizing the measurement cur- and emotional problems. New York: Longman.
riculum. American Psychologist, 48, 886–891. Mershon, B., & Gorsuch, R. L. (1988). Number of
Meijer, R. R., & Nering, M. L. (1999). Computer- factors in the personality sphere: Does increase in
ized adaptive testing: Overview and introduction. factors increase predictability of real life criteria?
Applied Psychological Measurement, 23, 187–194. Journal of Personality and Social Psychology, 55,
Melchior, L. A., Huba, G. J., Brown, V. B., & Reback, 675–680.
C. J. (1993). A short depression index for women. Messick, S. (1960). Dimensions of social desirability.
Educational and Psychological Measurement, 53, Journal of Consulting Psychology, 24, 279–287.
1117–1125. Messick, S. (1962). Response style and content mea-
Melzack, R. (1975). The McGill Pain Questionnaire: sures from personality inventories. Educational and
Major properties and scoring methods. Pain, 1, 277– Psychological Measurement, 22, 41–56.
299. Messick, S. (1975). Meaning and values in measure-
Melzack, R. (1984). Measurement of the dimensions ment and evaluation. American Psychologist, 35,
of pain. In B. Bromm (Ed.), Pain measurement in 1012–1027.
man Neurophysiological correlates of pain. New York: Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educa-
Elsevier. tional measurement (3rd ed., pp. 13–103). New York:
Melzack, R. (1987). The short-form McGill Pain Ques- Macmillan.
tionnaire. Pain, 30, 191–197. Messick, S. (1995). Validity of psychological assess-
Melzack, R., & Torgerson, W. S. (1971). On the lan- ment. American Psychologist, 50, 741–749.
guage of pain. Anesthesiology, 34, 50–59. Messick, S., & Anderson, S. (1970). Educational test-
Melzack, R., & Wall, P. D. (1965). Pain mechanisms: A ing, individual development, and social responsibil-
new theory. Science, 150, 971–979. ity. The Counseling Psychologist, 2, 80–88.
Mendoza, R. H. (1989). An empirical scale to mea- Messick, S. J., & Jackson, D. N. (1958). The measure-
sure type and degree of acculturation in Mexican- ment of authoritarian attitudes. Educational and
American adolescents and adults. Journal of Cross- Psychological Measurement, 18, 241–253.
Cultural Psychology, 20, 372–385. Messick, S. & Jungeblut, A. (1981). Time and method
Mercer, C. D., Algozzine, B., & Trifiletti, J. J. in coaching for the SAT. Psychological Bulletin, 89,
(1979). Early identification issues and considera- 191–216.
tions. Exceptional Children, 46, 52–54. Metcalfe, M., & Goldman, E. (1965). Validation of an
Mercer, J. (1973). Labeling the mentally retarded. Berke- inventory for measuring depression. British Journal
ley, CA: University of California Press. of Psychiatry, 111, 240–242.
Mercer, J. R. (1976). A system of multicultural plural- Metzger, A. M. (1979). A Q-methodological study of
istic assessment (SOMPA) in Proceedings: With bias the Kübler-Ross stage theory. Omega, 10, 291–301.
toward none. Lexington, KY: Coordinating Office for Meyer, A. B., Fouad, N. A., & Klein, M. (1987).
Regional Resource Centers, University of Kentucky. Vocational inventories. In B. Bolton (Ed.), Hand-
Mercer, J., Gomez-Palacio, M., & Padilla, E. (1986). book of measurement and evaluation in rehabilita-
The development of practical intelligence in cross- tion (2nd ed., pp. 119–138). Baltimore, MD: Paul H.
cultural perspective. In R. J. Sternberg & R. K. Wag- Brookes.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 591

Meyer, G. J. (1997). Assessing reliability: Critical cor- instruments (Vol. 2, pp. 5–43). Boston, MA: Allyn &
rections for a critical examination of the Rorschach Bacon.
Comprehensive System. Psychological Assessment, 9, Millon, T., Green, C. J., & Meagher, R. B. Jr. (1982a).
480–489. Millon Adolescent Personality Inventory Manual.
Meyer, G. J., Finn, S. E., & Eyde, L. D. (2001). Psycho- Minneapolis, MN: National Computer Systems.
logical testing and psychological assessment. Amer- Millon, T., Green, C. J., & Meagher, R. B. Jr. (1982b).
ican Psychologist, 56, 128–165. Millon Behavioral Health Inventory Manual (3rd
Meyer, P., & Davis, S. (1992). The California Psycho- ed.). Minneapolis, MN: National Computer Sys-
logical Inventory Applications Guide. Palo Alto, CA: tems.
Consulting Psychologists Press. Miner, J. (1964). Scoring guide for the Miner Sentence
Meyers, C. E., Nihira, K., & Zetlin, A. (1979). The mea- Completion Scale. New York: Springer-Verlag.
surement of adaptive behavior. In N. R. Ellis (Ed.), Minton, H. L. (1984). The Iowa Child Welfare Research
Handbook of mental deficiency: Psychological theory Station and the 1940 debate of intelligence: Carrying
and research (2nd ed., pp. 215–253). Hillsdale, NJ: on the legacy of a concerned mother. Journal of the
Erlbaum. History of the Behavioral Sciences, 20, 160–176.
Michelson, L., & Wood, R. (1982). Development and Mischel, W. (1968). Personality and assessment. New
psychometric properties of the Children’s Assertive York: Wiley.
Behavior Scale. Journal of Behavioral Assessment, 4, Mischel, W. (1977). On the future of personality mea-
3–13. surement. American Psychologist, 32, 246–254.
Miethe, T. D. (1985). The validity and reliability of Mischel, W. (1981). Introduction to personality (3rd
value measurements. The Journal of Psychology, 119, ed.). New York: Holt, Rinehart and Winston.
441–453. Mishra, S. P. (1971). Wechsler Adult Intelligence Scale:
Millard. K. A. (1952). Is How Supervise? an intel- Examiner vs. machine administration. Psychological
ligence test? Journal of Applied Psychology, 36, Reports, 29, 759–762.
221–224. Mishra, S. (1983). Evidence of item bias in the ver-
Miller, B. C., Rollins, B. C., & Thomas, D. L. (1982). On bal subtests of the WISC-R for Mexican-American
methods of studying marriages and families. Journal children. Journal of Psychoeducational Assessment, 1,
of Marriage and the Family, 44, 851–873. 321–328.
Miller, D., Wertz, O., & Counts, S. (1961). Racial differ- Misiak, H., & Sexton, V. S. (1966). History of Psychology.
ences on the MMPI. Journal of Clinical Psychology, New York: Grune & Stratton.
17, 159–161. Mitchell, A. J. (1937). The effect of bilingualism on
Miller, J. F., & Powers, M. J. (1988). Development of the measurement of intelligence. Elementary School
an instrument to measure hope. Nursing Research, Journal, 38, 29–37.
37, 6–10. Mitchell, R. (1973). Defining medical terms. Develop-
Miller, K. M., & Biggs, J. B. (1958). Attitude change mental Medicine and Child Neurology, 15, 279.
through undirected group discussion. Journal of Mitchell, T. W., & Klimoski, R. J. (1982). Is it ratio-
Educational Psychology, 49, 224–228. nal to be empirical? A test of methods for scoring
Miller, L. C., Murphy, R., & Buss, A. H. (1981). Con- biographical data. Journal of Applied Psychology, 67,
sciousness of body: Private and public. Journal of 411–418.
Personality and Social Psychology, 41, 397–406. Mittenberg, W., Azrin, R., Millsaps, C., & Heilbronner,
Miller, M. D., & Linn, R. L. (2000). Validation of R. (1993). Identification of malingered head injury
performance-based assessments. Applied Psycholog- on the Wechsler Memory Scale-Revised. Psycholog-
ical Measurement, 24, 367–378. ical Assessment, 5, 34–40.
Miller, W. R. (1976). Alcoholism scales and objective Mittenberg, W., Rotholc, A., Russell, E., & Heilbronner,
assessment methods: A review. Psychological Bul- R. (1996). Identification of malingered head injury
letin, 83, 649–674. on the Halstead-Reitan Battery. Toxicology Letters,
Millman, J., Bishop, C. H., & Ebel, R. (1965). An anal- 11, 271–281.
ysis of testwiseness. Educational and Psychological Mittenberg, W., Theroux-Fichera, S., Zielinski, R. E.,
Measurement, 25, 707–726. & Heilbronner, R. L. (1995). Identification of malin-
Millon, T. (1987). Millon Clinical Multiaxial Inventory- gered head injury on the Wechsler Adult Intelligence
II: MCMI-II (2nd ed.). Minneapolis, MN: National Scale-Revised. Professional Psychology: Research and
Computer Systems. Practice, 26, 491–498.
Millon, T., & Green, C. (1989). Interpretive guide to the Mizes, J. S., & Crawford, J. (1986). Normative values
Millon Clinical Multiaxial Inventory (MCMI-II). In on the Marks and Mathews Fear Questionnaire: A
C. S. Newmark (Ed.), Major psychological assessment comparison as a function of age and sex. Journal of
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

592 References

Psychopathology and Behavioral Assessment, 8, 253– MCMI: A controlled study. Journal of Consulting and
262. Clinical Psychology, 55, 113–114.
Montgomery, G. T., & Orozco, S. (1984). Validation of Morey, L. C., & LeVine, D. J. (1988). A multitrait-
a measure of acculturation for Mexican Americans. multimethod examination of Minnesota Multi-
Hispanic Journal of Behavioral Sciences, 6, 53–63. phasic Personality Inventory (MMPI) and Millon
Moore, M. (1975). Rating vs. ranking in the Rokeach Clinical Multiaxial Inventory (MCMI). Journal of
Value Survey: An Israeli comparison. European Jour- Psychopathology and Behavioral Assessment, 10, 333–
nal of Social Psychology, 5, 405–408. 344.
Moore, N. C., Summer, K. R., & Bloor, R. N. (1984). Morgan, W. G. (1995). Origin and history of the The-
Do patients like psychometric testing by computer? matic Apperception Test images. Journal of Person-
Journal of Clinical Psychology, 40, 875–877. ality Assessment, 65, 237–254.
Moos, R. H. (1968). The assessment of the social Morris, J. N., & Sherwood, S. (1975). A retesting and
climates of correctional institutions. Journal of modification of the Philadelphia Geriatric Center
Research in Crime and Delinquency, 5, 174–188. Morale Scale. Journal of Gerontology, 30, 77–84.
Moos, R. H. (1972). Assessment of the psychosocial Mosel, J. N. (1952). The validity of rational ratings
environments of community-oriented psychiatric on experience and training. Personnel Psychology, 5,
treatment programs. Journal of Abnormal Psychol- 1–9.
ogy, 79, 9–18. Mosel, J. N., & Cozan, L. W. (1952). The accuracy of
Moos, R. H. (1979). Evaluating educational environ- application blank work histories. Journal of Applied
ments. San Francisco, CA: Jossey-Bass. Psychology, 36, 365–369.
Moos, R. H., & Gerst, M. S. (1974). University Resi- Mosher, D. L. (1965). Approval motive and accep-
dence Environmene Scale. Palo Alto, CA: Consulting tance of “fake” personality interpretations which
Psychologists Press. differ in favorability. Psychological Reports, 17,
Moos, R. H., & Houts, P. (1968). Assessment of the 395–402.
social atmosphere of psychiatric wards. Journal of Mosier, C. I. (1941). A short cut in the estimation of
Abnormal Psychology, 73, 595–604. the split-halves coefficient. Educational and Psycho-
Moos, R. H., & Swindle, R. W., Jr. (1990). Stressful logical Measurement, 1, 407–408.
life circumstances: Concepts and measures. Stress Mowrer, O. H. (1960). The psychology of hope. San
Medicine, 6 171–178. Francisco, CA: Jossey-Bass.
Moos, R. H., & Trickett, E. J. (1974). Classroom Envi- Mowrer, O. H. (Ed.). (1967). Morality and mental
ronment Scale manual. Palo Alto, CA: Consulting health. Chicago, IL: Rand McNally & Co.
Psychologists Press. Mulaik, S. A. (1964). Are personality factors raters’
Moracco, J., & Zeidan, M. (1982). Assessment of sex conceptual factors? Journal of Consulting Psychology,
knowledge and attitude of non-Western medical 28, 506–511.
students. Psychology, 19, 13–21. Mullis, I. V. S. (1992). Developing the NAEP content-
Moran, A. (1986). The reliability and validity of area frameworks and innovative assessment meth-
Raven’s Standard Progressive Matrices for Irish ods in the 1992 assessments of mathematics, read-
apprentices. International Review of Psychology, 35, ing, and writing. Journal of Educational Measure-
533–538. ment, 29, 111–131.
Moran, P. W., & Lambert, M. J. (1983). A review of Mumford, M. D., & Owens, W. A. (1987). Methodology
current assessment tools for monitoring changes in review: Principles, procedures, and findings in the
depression. In M. S. Lambert, E. R. Christensen, & application of background data measures. Applied
S. S. DeJulio (Eds.), The assessment of psychotherapy Psychological Measurement, 11, 1–31.
outcome. (pp. 263–303). New York: Wiley. Mumley, D., L., Tillbrook, C. E., & Grisso, T. (2003).
Moreland, K. L. (1987). Computerized psychological Five year research update (1996–2000): Evaluations
assessment: What’s available. In J. N. Butcher (Ed.), for Competence to Stand Trial (Adjudicative Com-
Computerized psychological assessment (pp. 26–49). petence). Behavioral Sciences and the Law, 21, 329–
New York: Basic Books. 350.
Moreland, K. L. (1991). Assessment of validity Murphy, J. J. (1972). Current practices in the use of
in computer-based test interpretations. In T. B. psychological testing by police agencies. Journal of
Gutkin & S. L. Wise (Eds.), The computer and the Criminal Law, Criminology, and Police Science, 63,
decision-making process (pp. 43–74). Hillsdale, NJ: 570–576.
Erlbaum. Murphy, K. R. (1994). Potential effects of banding as a
Moreland, K. L., & Onstad, J. A. (1987). Validity of function of test reliability. Personnel Psychology, 47,
Millon’s computerized interpretation system for the 477–495.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 593

Murphy, K. R., & Constans, J. I. (1987). Behavioral ism and jurors’ perceptions of defendant culpability.
anchors as a source of bias in rating. Journal of Journal of Applied Psychology, 78, 34–42.
Applied Psychology, 72, 573–577. Nathan, B. R., & Alexander, R. A. (1988). A compar-
Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). ison of criteria for test validation: A metaanalytic
Nature and consequences of halo error: A critical investigation. Personnel Psychology, 41, 517–535.
analysis. Journal of Applied Psychology, 78, 218–225. Nathan, B. R., & Tippins, N. (1990). The consequences
Murray, H. A. (1938). Explorations in personality. New of halo “error” in performance ratings: A field study
York: Oxford University Press. of the moderating effect of halo on test validation
Murray, H. A. (1943). Thematic Apperception Test: results. Journal of Applied Psychology, 75, 290–296.
Pictures and manual. Cambridge, MA: Harvard National Commission on Excellence in Education
University Press. (1983). A nation at risk: The imperative for educa-
Murray, H. A. (1967). Autobiography. In E. G. Bor- tional reform. Washington, DC: U.S. Government
ing & G. Lindzey (Eds.), A history of psychology in Printing Office.
autobiography (Vol. 5, pp. 283–310). New York: National Society to Prevent Blindness (1974). The Sym-
Appleton-Century-Crofts. bol Chart for Twenty Feet – Snellen Scale. New York:
Murray, J. B. (1990). Review of research on the Myers- Author.
Briggs Type Indicator. Perceptual and Motor Skills, Naughton, M. J., & Wiklund, I. (1993). A critical review
70, 1187–1202. of dimension-specific measures of health-related
Murstein, B. J. (1963). Theory and research in projective quality of life in cross-cultural research. Quality of
techniques (emphasizing the TAT). New York: Wiley. Life Research, 2, 397–432.
Myers, A. M., Holliday, P. J., Harvey, K. A., & Hutchin- Nedelsky, L. (1954). Absolute grading standards for
son, K. S. (1993). Functional performance measures: objective tests. Educational and Psychological Mea-
Are they superior to self-assessments? Journal of surement, 14, 3–19.
Gerontology, 48, 196–206. Needham, W. E., & Eldridge, L. S. (1990). Performance
Myers, I. B., & McCaulley, M. H. (1985). Manual: A of blind vocational rehabilitation clients on the Min-
guide to the development and use of the Myers-Briggs nesota Rate of Manipulation Tests. Journal of Visual
Type Indicator. Palo Alto, CA: Consulting Psycholo- Impairment and Blindness, 84, 182–185.
gists Press. Neeper, R., & Lahey, B. B. (1984). Identification of two
Myklebust, H. (1964). The psychology of deafness. New dimensions of cognitive deficits through the factor
York: Grune & Stratton. analysis of teacher ratings. School Psychology Review,
Nagle, R. J. (1979). The McCarthy Scales of Children’s 13, 485–490.
Abilities: Research implications for the assessment Neiner, A. G., & Owens, W. A. (1985). Using biodata to
of young children. School Psychology Digest, 8, 319– predict job choice among college graduates. Journal
326. of Applied Psychology, 70, 127–136.
Nagle, R. J., & Bell, N. L. (1995). Validation of an item- Neisser, U. (1976). General, academic, and artificial
reduction short form of the Stanford-Binet Intelli- intelligence. In L. Resnick (Ed.), The nature of intel-
gence Scale: Fourth edition with college students. ligence (pp. 135–144). Hillsdale, NJ: Erlbaum.
Journal of Clinical Psychology, 51, 63–70. Neisser, U., Boodoo, G., Bouchard, T. J., Jr., & Boykin,
Naglieri, J. A. (1981). Factor structure of the WISC-R A. W. (1996). Intelligence: Knowns and unknowns.
for children identified as learning disabled. Psycho- American Psychologist, 51, 77–101.
logical Reports, 49, 891–895. Neisworth, J. T., & Bagnato, S. J. (1986). Curriculum-
Naglieri, J. A. (1988). Draw A Person: A quantitative based developmental assessment: Congruence of
scoring system. San Antonio, TX: Psychological Cor- testing and teaching. School Psychology Review, 15,
poration. 180–199.
Naglieri, J. A., & Kaufman, A. S. (1983). How many Nelson, R. O. (1977). Methodological issues in assess-
factors underlie the WAIS-R? The Journal of Psy- ment via self-monitoring. In J. D. Cone & R. P.
choeducational Assessment, 1, 113–120. Hawkins (Eds.), Behavioral assessment: New direc-
Naglieri, J. A., & Welch, J. A. (1991). Use of Raven’s tions in clinical psychology (pp. 217–240). New York:
and Naglieri’s Noverbal Matrix Tests. Journal of the Brunner/Mazel.
American Deafness and Rehabilitation Association, Nelson, R. O., & Hayes, S. C. (1979). Some current
24, 98–103. dimensions of behavioral assessment. Behavioral
Nairn, A. and associates (1980). The reign of ETS. Assessment, 1, 1–16.
Washington DC: Ralph Nader. Nelson-Gray, R. O. (1991). DSM IV: Empirical guide-
Narby, D. J., Cutler, B. L., & Moran, G. (1993). A meta- lines for psychometrics. Journal of Abnormal Psy-
analysis of the association between authoritarian- chology, 100, 308–315.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

594 References

Nester, M. A. (1984). Employment testing for handi- (Eds.), The computer and the decision-making process
capped persons. Public Personnel Management Jour- (pp. 177–197). Hillsdale, NJ: Erlbaum.
nal, 13, 417–434. Norman, R. D., & Daley, M. F. (1959). Senescent
Nester, M. A. (1993). Psychometric testing and rea- changes in intellectual ability among superior older
sonable accomodation for persons with disabilities. women. Journal of Gerontology, 14, 457–464.
Rehabilitation Psychology, 38, 75–85. Norman, W. T. (1963). Toward an adequate taxonomy
Neugarten, B. (1974). Age groups in the young-old. of personality attributes: Replicated factor structure
Annals of the American Academy of Political and in peer nomination personality ratings. Journal of
Social Science, 415, 187–198. Abnormal and Social Psychology, 66, 574–583.
Neugarten, B. L., & Associates (Eds.). (1964). Person- Norring, C., & Sohlberg, S. (1988). Eating Disorder
ality in middle and late life. New York: Atherton. Inventory in Sweden: Description, cross-cultural
Neugarten, B. L., Havighurst, R. J., & Tobin, S. S. comparison, and clinical utility. Acta Psychiatrica
(1961). The measurement of life satisfaction. Jour- Scandinavica, 78, 567–575.
nal of Gerontology, 16, 134–143. Norton, R. (1983). Measuring marital quality: A crit-
Nevo, B. (1985). Face validity revisited. Journal of Edu- ical look at the dependent variable. Journal of Mar-
cational Measurement, 22, 287–293. riage and the Family, 45, 141–151.
Newcomb, T. M. (1950). Social psychology. New York: Nowacek, G. A., Pullen, E., Short, J., & Blumner, H. N.
Holt. (1987). Validity of MCAT scores as predictors of
Newman, J. P. (1989). Aging and depression. Psychol- preclinical grades and NBME Part I examination
ogy and aging, 4, 150–165. scores. Journal of Medical Education, 62, 989–991.
Ng, S. H., Akhtar Hossain, A. B. M., Ball, P., Bond, Nowicki, S., & Duke, M. P. (1974). A locus of control
M. H., Hayashi, K., Lim, S. P., O’Driscoll, M. P., scale for non-college as well as college adults. Journal
Sinha, D., & Yang, K. S. (1982). Human values in of Personality Assessment, 38, 136–137.
nine countries. In R. Rath, H. S. Asthana, D. Sinha, Nowicki, S., & Strickland, B. R. (1973). A locus of con-
& J. B. P. Sinha (Eds.), Diversity and unity in cross- trol scale for children. Journal of Consulting Psychol-
cultural psychology. Lissie, the Netherlands: Swets & ogy, 40, 148–154.
Zeitlinger. Nunnally, J. (1962). The analysis of profile data. Psy-
Nichols, R. C., & Schnell, R. R. (1963). Factor scales chological Bulletin, 59, 311–319.
for the California Psychological Inventory. Journal Nunnally, J., & Husek, T. R. (1958). The phony lan-
of Consulting Psychology, 27, 228–235. guage examination: An approach to the measure-
Nicholson, C. L. (1977). Correlations between ment of response bias. Educational and Psychological
the Quick Test and the Wechsler Scale for Measurement, 2, 275–282.
Children-Revised. Psychological Reports, 40, 523– Nussbaum, K., Wittig, B. A., Hanlon, T. E., & Kurland,
526. A. A. (1963). Intravenous mialamide in the treat-
Nicholson, R. A., Briggs, S. R., & Robertson, H. C. ment of depressed female patients. Comprehensive
(1988). Instruments for assessing competency to Psychiatry, 4, 105–116.
stand trial: How do they work? Professional Psychol- Nuttall, E. V., Romero, I., & Kalesnik, J. (Eds.). (1992).
ogy, 19, 383–394. Assessing and screening preschoolers. Boston, MA:
Nietzel, M. T., Martorano, R. D., & Melnick, J. (1977). Allyn & Bacon.
The effects of covert modeling with and without Oakland, T. (1985a). Review of the Otis-Lennon
reply training on the development and generaliza- School Ability Test. In J. V. Mitchell, Jr. (Ed.),
tion of assertive responses. Behavior Therapy, 8, 183– The ninth mental measurements yearbook (Vol.
192. 2, pp. 1111–1112). Lincoln, NE: University of
Nihira, K., Foster, R., Shellhass, M., & Leland, H. Nebraska Press.
(1974). AAMD Adaptive Behavior Scale. Washing- Oakland, T. (1985b). Review of the Slosson Intelligence
ton, DC: American Association of Mental Defi- Test. In J. V. Mitchell, Jr. (Ed.), The ninth men-
ciency. tal measurements yearbook (Vol. 2, pp. 1401–1403).
Nolan, D. R., Hameke, T. A., & Barkley, R. A. (1983). A Lincoln, NE: University of Nebraska Press.
comparison of the patterns of the neuropsychologi- Oakland, T., & Matuszek, P. (1977). Using tests in
cal performance in two groups of learning-disabled nondiscriminatory assessment. In T. Oakland (Ed.),
children. Journal of Clinical Child Psychology, 12, 13– Psychological and educational assessment in minority
21. group children. New York: Bruner/Mazel.
Noonan, J. V., & Sarvela, P. D. (1991). Implementa- Obayuwana, A. O., Collins, J. L., Carter, A. L., Rao,
tion decisions in designing computer-based instruc- M. S., Mathura, C. C., & Wilson, S. B. (1982).
tional testing programs. In T. B. Gutkin & S. L. Wise Hope Index Scale: An instrument for the objective
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 595

assessment of hope. Journal of the National Medical uate program admission. ETS Research Report,
Association of New York, 74, 761. 84–14.
Oblowitz, N., Green, L., & Heyns, I. (1991). A self- Ones, D. S., Chockalingam, V., & Schmidt, F. L. (1995).
concept scale for the hearing-impaired. The Volta Integrity tests: Overlooked facts, resolved issues,
Review, 93, 19–29. and remaining questions. American Psychologist, 50,
O’Brien, N. P. (1988). Test construction. New York: 456–457.
Greenwood Press. Ones, D. S., Viswesvaran, C., & Schmidt, F. L.
O’Dell, J. (1972). P. T. Bamum explores the computer. (1993). Comprehensive metaanalysis of integrity
Journal of Consulting and Clinical Psychology, 38, test validities: Findings and implications for per-
270–273. sonnel selection and theories of job performance.
O’Donnell, J. M. (1979). The clinical psychology of Journal of Applied Psychology, 78, 679–703.
Lightner Witmer: A case study of institutional inno- Ong, J., & Marchbanks, R. L. (1973). Validity of selected
vation and intellectual change. Journal of the History academic and non-academic predictors of optome-
of the Behavioral Sciences, 15, 3–17. try grades. American Journal of Optimetry, 50, 583–
O’Donohue, W., & Letourneau, E. (1992). The psycho- 588.
metric properties of the penile tumescence assess- Oosterhof, A. C. (1976). Similarity of various item dis-
ment of child molesters. Journal of Psychopathology crimination indices. Journal of Educational Measure-
and Behavioral Assessment, 14, 123–174. ment, 13, 145–150.
Olbrisch, M. E. (1983). Development and validation Oppenheim, A. N. (1992). Questionnaire design, inter-
of the Ostomy Adjustment Scale. Rehabilitation Psy- viewing and attitude measurement. London: Pinter.
chology, 28, 3–12. Ortiz, V. Z., & Gonzalez, A. (1989). Validation of a short
Olea, M. M., & Ree, M. J. (1994). Predicting pilot and form of the WISC-R with accelerated and gifted
navigator criteria: Not much more than g. Journal Hispanic students. Gifted Child Quarterly, 33, 152–
of Applied Psychology, 79, 845–851. 155.
Oliver, J. M., & Burkham, R. (1979). Depression in Ortiz, V. Z., & Volkoff, W. (1987). Identification of
university students: Duration, relation to calendar gifted and accelerated Hispanic students. Journal for
time, prevalence, and demographic correlates. Jour- the Education of the Gifted, 11, 45–55.
nal of Abnormal Psychology, 88, 667–670. Orvik, J. M. (1972). Social desirability for the individ-
Oliveri. M. E., & Reiss, D. (1984). Family concepts and ual, his group, and society. Multivariate Behavioral
their measurement; Things are seldom what they Research, 7, 3–32.
seem. Family Process, 23, 33–48. Osborne, R. T. (1994). The Burt collection. Journal of
Ollendick, T. H. (1979). Discrepancies between Ver- the History of the Behavioral Sciences, 30, 369–379.
bal and Performance IQs and subtest scatter on Osgood, C., & Luria, Z. (1954). A blind analysis of a
the WISC-R for juvenile delinquents. Psychological case of multiple personality using the Semantic Dif-
Reports, 45, 563–568. ferential. Journal of Abnormal and Social Psychology,
Olmedo, E. L. (1979). Acculturation. A psychometric 49, 579–591.
perspective. American Psychologist, 34, 1061–1070. Osgood, C., Suci, G., & Tannenbaum, P. (1957). The
Olmedo, E. L. (1981). Testing linguistic minorities. measurement of meaning. Urbana, IL: University of
American Psychologist, 36, 1078–1085. Illinois Press.
Olmedo, E. L., Martinez, J. L., & Martinez, S. R. (1978). Osipov, V. P. (1944). Malingering: The simulation of
Measure of acculturation for Chicano adolescents. psychosis. Bulletin of the Meninger Clinic, 8, 39–42.
Psychological Reports, 42, 159–170. Osipow, S. H. (1983). Theories of career development
Olmedo, E. L., & Padilla, A. M. (1978). Empirical and (3rd ed.). New York: Appleton-Century-Crofts.
construct validation of a measure of acculturation, Osipow, S. H. (Ed.). (1987). Manual for the Career
for Mexican-Americans. Journal of Social Psychology, Decision Scale. Odessa, FL: Psychological Assess-
105, 179–187. ment Resources.
Olson, D. H., McCubbin, H. I, Barnes, H., Larsen, A., Oskamp, S. (1991). Attitudes and opinions (2nd ed.).
Muxen, M., & Wilson, M., & Wilson, M. (1985). Englewood Cliffs, NJ: Prentice Hall.
Family inventories. St. Paul, MN: University of Min- OSS Assessment Staff (1948). Assessment of men: Selec-
nesota. tion of personnel for the Office of Strategic Services.
Olson, D. H., & Rabunsky, C. (1972). Validity of four New York: Rinehart.
measures of family power. Journal of Marriage and Osterhouse, R. A. (1972). Desensitization and study-
the Family, 34, 224–234. skills training as treatment for two types of test-
Oltman, P. K., & Hartnett, R. T. (1984). The role anxious students. Journal of Counseling Psychology,
of GRE General and Subject Test scores in grad- 19, 301–307.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

596 References

Osterlind, S. J. (1983). Test item bias. Newbury Park, and skills. Casnadian Journal of Psychology, 37, 461–
CA: Sage. 483.
Osterlind, S. J. (1989). Constructing test items. Boston, Pankratz, L., Fausti, S., & Peed, S. (1975). A forced-
MA: Kluwer Academic. choice technique to evaluate deafness in the hysteri-
Ostrom, T. M. (1969). The relationship between the cal or malingering patient. Journal of Consulting and
affective, behavioral, and cognitive components of Clinical Psychology, 43, 421–422.
attitude. Journal of Experimental Social Psychology, Panton, J. H. (1958). Predicting prison adjustment
5, 12–30. with the MMPI. Journal of Clinical Psychology, 14,
Ouellette, S. E. (1988). The use of projective drawing 308–312.
techniques in the personality assessment of prelin- Parker, K. (1983). A meta-analysis of the reliability
gually deafened young adults: A pilot study. Ameri- and validity of the Rorschach. Journal of Personality
can Annals of the Deaf, 133, 212–218. Assessment, 47, 227–231.
Overall, J. E. (1974). Validity of the Psychological Parrott, C. A. (1986). Validation report on the
Screening Inventory for psychiatric screening. Jour- Verbalizer-Visualizer Questionnaire. Journal of
nal of Consulting and Clinical Psychology, 42, 717– Mental Imagery, 10, 39–42.
719. Pascal, G. R., & Suttell, B. J. (1951). The Bender-Gestalt
Overall, J. E., & Magee, K. N. (1992). Estimating indi- Test: Its quantification and validity for adults. New
vidual rater reliabilities. Applied Psychological Mea- York: Grune & Stratton.
surement, 16, 77–85. Passow, H. (1985). Review of School and College
Owen, D. (1985). None of the above. Boston, MA: Ability Tests, Series III. In J. V. Mitchell, Jr. (Ed.),
Houghton Mifflin. The ninth mental measurements yearbook (Vol. 2, pp.
Owen, S. V., & Baum, S. M. (1985). The validity of the 1317–1318). Lincoln, NE: University of Nebraska
measurement of originality. Educational and Psy- Press.
chological Measurement, 45, 939–944. Pastore, N. (1978). The Army intelligence tests and
Owens, W. A. (1976). Background data. In M. D. Dun- Walter Lippmann. Journal of the History of the Behav-
nette (Ed.), Handbook of industrial psychology. New ioral Sciences, 14, 316–327.
York: Rand-McNally. Patterson, C. H. (1946). A comparison of various
Owens, W. A., Glennon, J. R., & Albright, L. E. (1962). “short forms” of the Wechsler-Bellevue Scale. Jour-
Retest consistency and the writing of life history nal of Consulting Psychology, 10, 260–267.
items: A first step. Journal of Applied Psychology, 46, Paulhus, D. L. (1984). Two-component models of
329–332. socially desirable responding. Journal of Personality
Ownby, R. L., & Carmin, C. N. (1988). Confirma- and Social Psychology, 46, 598–609.
tory factor analyses of the Stanford-Binet Intelli- Paulhus, D. L. (1986). Self-deception and impres-
gence Scale (4th ed.). Journal of Psychoeducational sion management in test responses. In A. Angleit-
Assessment, 6, 331–340. ner & J. S. Wiggins (Eds.), Personality assessment
Pace, C. R., & Stern, G. G. (1958). An approach to via questionnaires (pp. 143–165). Berlin: Springer-
the measurement of psychological characteristics of Verlag.
college environments. Journal of Educational Psy- Paulhus, D. L. (1991). Measurement and control of
chology, 49, 269–277. response bias. In J. P. Robinson, P. R. Shaver, &
Padilla, A. (1980). The role of cultural awareness and L. S. Wrightsman (Eds.), Measures of personality and
ethnic loyalty in acculturation. In A. Padilla (Ed.), social psychological attitudes (pp. 17–59). New York:
Acculturation: Theory, models, and some new find- Academic Press.
ings. Boulder, CO: Westview Press. Payne, S. L. (1951). The art of asking questions. Prince-
Padilla, E. R., Olmedo, E. L., & Loya, R. (1982). Accul- ton, NJ: Princeton University Press.
turation and the MMPI performance of Chicano Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980).
and Anglo college students. Hispanic Journal of Validity generalization results for tests used to pre-
Behavioral Sciences, 4, 451–466. dict job proficiency and training success in clerical
Paget, K. D., & Nagle, R. J. (1986). A conceptual model occupations. Journal of Applied Psychology, 65, 373–
of preschool assessment. School Psychology Review, 406.
15, 154–165. Pearson, B. Z. (1993). Predictive validity of the Scholas-
Paivio, A. (1971). Imagery and verbal processes. New tic Aptitude Test (SAT) for Hispanic bilingual stu-
York: Holt, Rinehart and Winston. dents. Hispanic Journal of Behavioral Sciences, 15,
Paivio, A. (1975). Imagery and synchronic thinking. 342–356.
Canadian Psychological Review, 16, 147–163. Pederson, K. (1990). Candidate Profile Record. In J.
Paivio, A., & Harshman, R. (1983). Factor analysis Hogan & R. Hogan (Eds.), Business and industry
of a questionnaire on imagery and verbal habits testing (pp. 357–359). Austin, TX: Pro-ed.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 597

Pedhazur, E. (1982). Multiple regression in behavioral Plenum series in social/clinical psychology. New
research: Explanation and prediction (2nd ed.). New York: Plenum.
York: Holt, Rinehart and Winston. Piers, E., & Harris, D. (1969). Manual for the Piers-
Penner, L. A., Homant, R., & Rokeach, M. (1968). Harris Children’s Self-concept Scale. Nashville, TN:
Comparison of rank-order and paired-comparison Counselor Recordings and Tests.
methods for measuring value systems. Perceptual Piersma, H. L. (1987). Millon Clinical Multiaxial
and Motor Skills, 27, 417–418. Inventory (MCMI) computer-generated diagnoses:
Pennock-Roman, M. (1990). Test validity and language How do they compare to clinician judgment? Jour-
background: A study of Hispanic American students at nal of Psychopathology and Behavioral Assessment, 9,
six universities. New York: College Entrance Exami- 305–312.
nation Board. Pintner, R. (1924). Results obtained with the non-
Peplau, L. A., & Perlman, D. (Eds.). (1982). Loneliness: language group tests. Journal of Educational Psychol-
A sourcebook of current theory, research and therapy. ogy, 15, 473–483.
New York: Wiley. Pintner, R., & Paterson, D. G. (1915). The Binet Scale
Perry, G. G., & Kinder, B. N. (1990). The suscepti- and the deaf child. Journal of Educational Psychology,
bility of the Rorschach to malingering: A critical 6, 201–210.
review. Journal of Personality Assessment, 54, 47– Piotrowski, C. (1983). Factor structure on the Seman-
57. tic Differential as a function of method of analy-
Petersen, N. S., & Novick, M. R. (1976). An evaluation sis. Educational and Psychological Measurement, 43,
of some models for culture-fair selection. Journal of 283–288.
Educational Measurement, 13, 3–29. Piotrowski, C. (1995). A review of the clinical and
Peterson, J. (1926). Early conceptions and tests of intel- research use of the Bender-Gestalt Test. Perceptual
ligence. Yonkers, NY: World Book Company. and Motor Skills, 81, 1272–1274.
Petrie, B. M. (1969). Statistical analysis of attitude scale Piotrowski, C., & Keller, J. W. (1984). Psychodiagnostic
scores. Research Quarterly, 40, 434–437. testing in APA-approved clinical psychology pro-
Pfeiffer, E. (1975). SPMSQ: Short Portable Mental Sta- grams. Professional Psychology: Research and Prac-
tus Questionnaire. Journal of the American Geriatric tice, 15, 450–456.
Society, 23, 433–441. Piotrowski, Z. A. (1964). Digital-computer interpre-
Phares, E. J. (1976). Locus of control in personality. Mor- tation of inkblot test data. Psychiatric Quarterly, 38,
ristown, NJ: General Learning Press. 1–26.
Phelps, L. (1989). Comparison of scores for intel- Plag, J. A., & Goffman, J. M. (1967). The Armed Forces
lectually gifted students on the WISC-R and the Qualification Test: Its validity in predicting military
fourth edition of the Stanford-Binet. Psychology in effectiveness for naval enlistees. Personnel Psychol-
the Schools, 26, 125–129. ogy, 20, 323–340.
Phelps, L., & Bell, M. C. (1988). Correlations between Plake, B. S., Reynolds, C. R., & Gutkin, T. B. (1981).
the Stanford-Binet: Fourth Edition and the WISC-R A technique for comparison of the profile variabil-
with a learning disabled population. Psychology in ity between independent groups. Journal of Clinical
the Schools, 25, 380–382. Psychology, 37, 142–146.
Phelps, L., & Branyan, L. T. (1988). Correlations among Plemons, G. (1977). A comparison of MMPI scores of
the Hiskey, K-ABC Nonverbal Scale, Leiter, and Anglo-and Mexican-American psychiatric patients.
WISC-R Performance Scale with public-school deaf Journal of Consulting and Clinical Psychology, 45,
children. Journal of Psychoeducational Assessment, 6, 149–150.
354–358. Pollard, W. E., Bobbitt, R. A., Bergner, M., Martin, D. P.,
Phillips, B. L., Pasewark, R. A., & Tindall, R. C. & Gilson, B. S. (1976). The Sickness Impact Profile:
(1978). Relationship among McCarthy Scales of Reliability of a health status measure. Medical Care,
Children’s Abilities, WPPSI, and Columbia Men- 14, 146–155.
tal Maturity Scale. Psychology in the Schools, 15, Ponterotto, J. G., Pace, T. M., & Kavan, M. G. (1989).
352–356. A counselor’s guide to the assessment of depres-
Phillips, S. (1980). Children’s perceptions of health and sion. Journal of Counseling and Development, 67,
disease. Canadian Family Physician, 26, 1171–1174. 301–309.
Piaget, J. (1952). The origins of intelligence in children. Pope, K. S. (1992). Responsibilities in providing psy-
New York: International Universities Press. chological test feedback to clients. Psychological
Piaget, J. (1967). Six psychological studies. New York: Assessment, 4, 268–271.
Random House. Popham, W. J. (1981). The case for minimum
Piedmont, R. L. (1998). The revised NEO Personal- competency testing. Phi Delta Kappan, 63, 89–
ity Inventory: Clinical and research applications. The 91.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

598 References

Poresky, R. H., Hendrix, C., Mosier, J. E., & Samuelson, Prout, H. T., & Schwartz, J. F. (1984). Validity of the
M. L. (1988). The Companion Animal Seman- Peabody Picture Vocabulary Test-Revised with men-
tic Differential: Long and short form reliability tally retarded adults. Journal of Clinical Psychology,
and validity. Educational and Psychological Measure- 40, 584–587.
ment, 48, 255–260. Pulliam, G. P. (1975). Social desirability and the
Porter, T. M. (1986). The rise of statistical think- Psychological Screening Inventory. Psychological
ing, 1820–1900. Princeton, NJ: Princeton University Reports, 36, 522.
Press. Quay, H. C., & Peterson, D. R. (1967). Manual for the
Poteat, G. M., Wuensch, K. L., & Gregg, N. B. (1988). Behavior Problem Checklist. Champaign, IL: Chil-
An investigation of differential prediction with the dren’s Research Center, University of Illinois.
WISC-R. Journal of School Psychology, 26, 59–68. Quay, L. C. (1974). Language dialect, age, and
Powers, D. E. (1993). Coaching for the SAT: A sum- intelligence-test performance in disadvantaged
mary of the summaries and an update. Educational black children. Child Development 45, 463–468.
Measurement: Issues and Practice, 12, 24–39. Quenk, N. L. (2000). Essentials of Myers-Briggs Type
Powers, D. E., & Alderman, D. L. (1983). Effect of test Indicator Assessment. New York: Wiley.
familiarization on SAT performance. Journal of Edu- Rabideau, G. F. (1955). Differences in visual acuity
cational Measurement, 20, 71–79. measurements obtained with different types of tar-
Powers, S., & Barkan, J. H. (1986). Concurrent validity gets. Psychological Monographs, 69, No. 10 (Whole
of the Standard Progressive Matrices for Hispanic No. 395).
and non-Hispanic seventh-grade students. Psychol- Rabkin, J. G., & Struening, E. L. (1976). Life events,
ogy in the Schools, 23, 333–336. stress, and illness. Science, 194, 1013–1020.
Pratkanis, A. R., & Greenwald, A. G. (1989). A Radloff, L. S. (1977). The CES-D Scale: A self-report
sociocognitive model of attitude structure and func- depression scale for research in the general popu-
tion. In L. Berkowitz (Ed.), Advances in experimental lation. Applied Psychological Measurement, 1, 385–
social psychology (Vol. 22, pp. 245–285). San Diego, 401.
CA: Academic Press. Radloff, L. S., & Teri, L. (1986). Use of the Center
Praver, F., DiGiuseppe, R., Pelcovitz, D., Mandel, F. S., for Epidemiological Studies-Depression Scale with
& Gaines, R. (2000). A preliminary study of a car- older adults. Clinical Gerontologist, 5, 119–136.
toon measure for children’s reactions to chronic Ragosta, M., & Nemceff, W. (1982). A research and
trauma. Child Maltreatment, 5, 273–285. development program on testing handicapped people
Preston, J. (1978). Abbreviated forms of the WISC-R. (RM 82-2). Princeton, NJ: Educational Testing Ser-
Psychological Reports, 42, 883–887. vice.
Prieto, E. J., & Geisinger, K. F. (1983). Factor-analytic Rahe, R. H., & Lind, E. (1971). Psychosocial factors
studies of the McGill Pain Questionnaire. In R. and sudden cardiac death: A pilot study. Journal of
Melzack (Ed.), Pain measurement and assessment Psychosomatic Research, 15, 19–24.
(pp. 85–93). New York: Raven Press. Rahe, R. H., Lundberg, V., Bennett, L., & Theorell, T.
Prigatano, G. P. (1978). Wechsler Memory Scale: A (1971). The social readjustment rating scale: A com-
selective review of the literature. Journal of Clinical parative study of Swedes and Americans. Journal of
Psychology, 34, 816–832. Psychosomatic Research, 15, 241–249.
Prigatano, G. P., & Amin, K. (1993). Digit Memory Rahe, R. H., Meyer, M., Smith, M., Kjaer, G., & Holmes,
Test: Unequivocal cerebral dysfunction and sus- T. H. (1964). Social stress and illness onset. Journal
pected malingering. Journal of Clinical and Exper- of Psychosomatic Research, 8, 35–44.
imental Neuropsychology, 15, 537–546. Raine, A. (1991). The SPQ: A scale for the assessment
Prigatano, G. P., & Redner, J. E. (1993). Uses and abuses of schizotypal personality based on DSM-III-R cri-
of neuropsychological testing in behavioral neurol- teria. Schizophrenia Bulletin, 17, 555–564.
ogy. Neurologic Clinics, 11, 219–231. Rajecki. D. W. (1990). Attitudes. Sunderland, MA: Sin-
Pritchard, D. A., & Rosenblatt, A. (1980). Racial bias auer.
in the MMPI: A methodological review. Journal of Ralston, D. A., Gustafson, D. J., Elsass, P. M., Che-
Consulting and Clinical Psychology, 48, 263–267. ung, F., & Terpstra, R. H. (1992). Eastern values: A
Prout, H. T., & Ferber, S. M. (1988). Analogue comparison of managers in the United States, Hong
assessment: Traditional personality assessment Kong, and the People’s Republic of China. Journal
measures in behavioral assessment. In E. S. Shapiro of Applied Psychology, 77, 664–671.
& T. R. Kratochwill (Eds.), Behavioral assessment in Ramanaiah, N. V., & Adams, M. L. (1979). Confirma-
schools: Conceptual foundations and practical appli- tory factor analysis of the WAIS and the WPPSI.
cations (pp. 322–350). New York: Guilford Press. Psychological Reports, 45, 351–355.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 599

Ramanaiah, N. V., & Martin, H. J. (1980). On the two- Reardon, R. C., Hersen, M., Bellack, A. S., & Foley,
dimensional nature of the Marlowe-Crowne Social J. M. (1979). Measuring social skill in grade school
Desirability Scale. Journal of Personality Assessment, boys. Journal of Behavioral Assessment, 1, 87–105.
44, 507–514. Ree, M. J., & Earles, J. A. (1991). Predictive training
Ramirez, M., Garza, R. T., & Cox, B. G. (1980). Mul- success: Not much more than g. Personnel Psychol-
ticultural leader behaviors in ethnically mixed task ogy, 44, 321–332.
groups (Technical Report). Office of Naval Research: Ree, M. J., Earles, J. A., & Teachout, M. (1994). Pre-
Organizational Effectiveness Research Program. dicting job performance: Not much more than g.
Randt, C. T., Brown, E. R., & Osborne, D. P., Jr. (1980). Journal of Applied Psychology, 79, 518–524.
A memory test for longitudinal measurement of Reed, M. (1970). Deaf and partially hearing chil-
mild to moderate deficits. Clinical Neuropsychology, dren. In P. Mittler (Ed.), The psychological assess-
2, 184–194. ment of mental and physical handicaps. London:
Rankin, W. L., & Grobe, J. W. (1980). A comparison Tavistock.
of ranking and rating procedures for value system Reed, P. F., Fitts, W. H., & Boehm, L. (1981). Tennesseee
measurement. European Journal of Social Psychology, Self-Concept Scale: Bibliography of research studies
10, 233–246. (Rev. ed.). Los Angeles, CA: Western Psychological
Raphael, E. G., Cloitre, M., & Dohrenwend, B. P. Services.
(1991). Problems of recall and misclassification with Rehfisch, J. M. (1958). A scale for personal rigidity.
checklist methods of measuring stressful life events. Journal of Consulting Psychology, 22, 11–15.
Health Psychology, 10, 62–74. Rehfisch, J. M. (1959). Some scale and test correlates
Rasch, G. (1966). An item-analysis which takes of a Personality Rigidity scale. Journal of Consulting
individual differences into account. British Jour- Psychology, 22, 372–374.
nal of Mathematical and Statistical Psychology, 19, Reich, J. H. (1987). Instruments measuring DSM-III
49–57. and DSM-III-R personality disorders. Journal of Per-
Rathus, S. A. (1973). A 30-item schedule for assessing sonality Disorders, 1, 220–240.
assertive behavior. Behavior Therapy, 4, 398–406. Reich, J. H. (1989). Update on instruments to mea-
Rathus, S. A., & Nevid, J. S. (1977). Concurrent validity sure DSM-III and DSM-III-R personality disorders.
of the 30-item Assertiveness Schedule with a psychi- Journal of Nervous and Mental Disease, 177, 366–370.
atric population. Behavior therapy, 8, 393–397. Reich, T., Robins, L. E., Woodruff, R. A., Jr., Taibleson,
Raven, J. C. (1938). Standard Progressive Matrices. M., Rich, C., & Cunningham, L. (1975). Computer-
London: H. K. Lewis. assisted derivation of a screening interview for alco-
Raven, J. C. (1947a). Coloured Progressive Matrices. holism. Archives of General Psychiatry, 32, 847–852.
London: H. K. Lewis. Reilly, R. R., & Chao, G. T. (1982). Validity and fairness
Raven, J. C. (1947b). Advanced Progressive Matrices. of some alternative selection procedures. Personnel
London: H. K. Lewis. Psychology, 35, 1–62.
Raven, J. C., Court, J. H., & Raven, J. (1977). Coloured Reilly, R. R., Henry, S., & Smither, J. W. (1990). An
Progressive Matrices. London: Lewis. examination of the effects of using behavior check-
Raven, J., Raven, J. C., & Court, J. H. (1998). Standard lists on the construct validity of assessment center
Progressive Matrices. 1998 edition. Oxford: Oxford dimensions. Personnel Psychology, 43, 71–84.
University Press. Reilly, R. R., & Knight, G. E. (1970). MMPI scores
Rawls, J. R., Rawls, D. J., & Harrison, C. W. (1969). of Mexican-American college students. Journal of
An investigation of success predictors in graduate College Student Personnel, 11, 419–422.
school in psychology. Journal of Psychology, 72, 125– Reissland, N. (1983). Cognitive maturity and the expe-
129. rience of fear and pain in the hospital. Social Science
Ray, J. J. (1971). A new measure of conservatism: Its Medicine, 17, 1389–1395.
limitations. British Journal of Social and Clinical Psy- Reitan, R. M. (1969). Manual for administration of neu-
chology, 10, 79–80. ropsychological test batteries for adults and children.
Ray, S., & Ulissi, S. M. (1982). Adaptation of the Wech- Indianapolis, IN: Author.
sler Preschool and Primary Scales of Intelligence for Reitan, R. M., & Davison, L. A. (Eds.). (1974). Clini-
deaf children. Natchitoches, LA: Steven Ray. cal neuropsychology: Current status and applications.
Razin, A. M. (1971). A-B variable in psychotherapy: A Washington, DC: Winston.
critical review. Psychological Bulletin, 75, 1–21. Reitan, R. M., & Wolfson, D. (1985). Halstead-Reitan
Reading, A. E., & Newton, J. R. (1978). A card sort Neuropsychological Test Battery: Theory and clin-
method of pain assessment. Journal of Psychosomatic ical interpretation. Tucson, AZ: Neuropsychology
Research, 22, 503–512. Press.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

600 References

Remmers, H. H., & Ewart, E. (1941). Reliability of Reynolds, C. R., & Kamphaus, R. W. (1997). The
multiple-choice measuring instruments as a func- Kaufman Assessment Battery for Children: Devel-
tion of the Spearman-Brown prophecy formula, III. opment, structure, and applications in neuropsy-
Journal of Educational Psychology, 32, 61–66. chology. In A. M. Horton, D. Wedding, et al. (Eds.),
Rentoul, A. J., & Fraser, B. J. (1979). Conceptualization The neuropsychology handbook: Vol. 1. Foundations
of enquiry-based or open classroom learning envi- and assessment (2nd ed., pp. 290–330). New York,
ronments. Journal of Curriculum Studies, 11, 233– NY: Springer-Verlag.
245. Reynolds, C. R., & Piersel, W. C. (1983). Multiple
Renzulli, J. S., Hartman, R. K., & Callahan, C. M. aspects of bias on the Boehm Test of Basic Con-
(1971). Teacher identification of superior students. cepts (Forms A and B) for white and for Mexican-
Exceptional Children, 38, 211–214, 243–248. American children. Journal of Psychoeducational
Renzulli, J. S., Smith, L. H., White, A. J., Callahan, Assessment, 1, 135–142.
C. M., & Hartman, R. K. (1976). Scales for rat- Reynolds, C. R., Willson, V. L., & Chatman, S. R.
ing the behavioral characteristics of superior students. (1984). Item bias on the 1981 revision of the Peabody
Wethersfield, CT: Creative Learning Press. Picture Vocabulary Test using a new method of
Reschly, D. J. (1978): WISC-R factor structures among detecting bias. Journal of Psychoeducational Assess-
Anglos, Blacks, Chicanos, and Native-American ment, 2, 219–224.
Papagos. Journal of Consulting and Clinical Psychol- Reynolds, C. R., Willson, V. L., & Chatman, S. R.
ogy, 46, 417–422. (1985). Regression analyses of bias on the Kaufman
Reschly, D. J., & Reschly, J. E. (1979). Brief reports Assessment Battery for Children. Journal of School
on the WISC-R: I. Validity of WISC-R factor scores Psychology, 23, 195–204.
in predicting achievement and attention for four Reynolds, W. M. (1979). A caution against the use of
socio-cultural groups. Journal of School Psychology, the Slosson Intelligence Test in the diagnosis of men-
17, 355–361. tal retardation. Psychology in the Schools, 16, 77–79:
Reschly, D. J., & Sabers, D. (1979). Analysis of test bias Reynolds, W. M. (1982). Development of reliable and
in four groups with the regression definition. Journal valid short forms of the Marlowe-Crowne Social
of Educational Measurement, 16, 1–9. Desirability Scale. Journal of Clinical Psychology, 38,
Resnick, D. (1982). History of educational testing. In 119–125.
A. K. Wigdor & W. R. Gamer (Eds.), Ability testing: Reynolds, W. M. (1985). Review of the Slosson Intelli-
Uses, consequences, and controversies: Part II. Docu- gence Test. In J. V. Mitchell, Jr. (Ed.), The ninth men-
mentation section (pp. 173–194). Washington, DC: tal measurements yearbook (Vol. 2, pp. 1403–1404).
National Academy Press. Lincoln, NE: University of Nebraska Press.
Resnick, P. J. (1988). Malingering of post-traumatic Rhodes, L., Bayley, N., & Yow, B. (1984). Supplement
disorders. In R. Rogers (Ed.), Clinical assessment of to the manual for the Bayley Scales of Infant Develop-
malingering and deception (pp. 84–103). New York: ment. San Antonio, TX: The Psychological Corpo-
Guilford Press. ration.
Reuter, J., Stancin, T., & Craig, P. (1981). Kent scoring Rich, C. C., & Anderson, R. P. (1965). A tactual form of
adaptation of the Bayley Scales of Infant Development. the Progressive Matrices for use with blind children.
Kent, OH: Kent Developmental Metrics. Personnel and Guidance Journal, 43, 912–919.
Reynell, J. (1969). Reynell Developmental Language Rich, J. (1968). Interviewing children and adolescents.
Scales. Windsor, England: NFER. London: Macmillan.
Reynolds, C. R. (1982). The problem of bias in psy- Richards, J. T. (1970). Internal consistency of the
chological assessment. In C. R. Reynolds & T. B. WPPSI with the mentally retarded. American Jour-
Gutkin (Eds.), The Handbook of School Psychology nal of Mental Deficiency, 74, 581–582.
(pp. 178–208). New York: Wiley. Richardson, A. (1977). Verbalizer-visualizer: A cogni-
Reynolds, C. R. (1985). Review of the System tive style dimension. Journal of Mental Imagery, 1,
of Multicultural Pluralistic Assessment. In J. V. 109–126.
Mitchell (Ed.), The ninth mental measurements year- Richardson, A. (1978). Subject, task, and tester
book (pp. 1519–1521). Lincoln, NE: University of variables associated with initial eye movement
Nebraska Press. responses. Journal of Mental Imagery, 2, 85–100.
Reynolds, C. R. (1989). Measurement and statistical Richardson, K. (1991). Understanding intelligence.
problems in neuropsychological assessment of chil- Philadelphia, PA: Milton Keynes.
dren. In C. R. Reynolds & E. Fletcher-Janzen (Eds.), Richman, N., Stevenson, J., & Graham, P. J. (1982).
Handbook of clinical child neuropsychology (pp. 147– Preschool to school: A behavioural study. New York:
166). New York: Plenum Press. Academic Press.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 601

Richmond, B. O., & Kicklighter, R. H. (1980). Chil- Robinson, J. P., Shaver, P. R., & Wrightsman, L. S.
dren’s Adaptive Behavior Scale. Atlanta, GA: Human- (1990). Measures of personality and social psycho-
ics. logical attitudes. San Diego, CA: Academic Press.
Ridgway, J., MacCulloch, M. J., & Mills, H. E. (1982). Roe, A., & Klos, D. (1969). Occupational classification.
Some experiences in administering a psychome- Counseling Psychologist, 1, 84–92.
tric test with a light pen and microcomputer. Roe, A., & Siegelman, M. (1964). The origin of interest.
International Journal of Man-Machine Studies, 17, Washington, DC: American Personnel and Guid-
265–278. ance Association.
Riethmiller, R. J., & Handler, L. (1997). The great fig- Rogers, R. (1984). Towards an empirical model of
ure drawing controversy: The integration of research malingering and deception. Behavioral Sciences and
and clinical practice. Journal of Personality Assess- the Law, 2, 93–111.
ment, 69, 488–496. Rogers, R., Bagby, M., & Gillis, R. (1992). Improve-
Rigazio-DiGilio, S. A. (1993). The Family System Test ments in the M test as a screening measure for malin-
(FAST): A spatial representation of family structure gering. Bulletin of the American Academy of Psychi-
and flexibility. The American Journal of Family Ther- atry and Law, 20, 101–104.
apy, 21, 369–375. Rogers, R., Harris, M., & Thatcher, A. A. (1983). Iden-
Ritzler, B. A., Sharkey, K. J., & Chudy, J. (1980). A com- tification of random responders on the MMPI: An
prehensive projective alternative to the TAT. Journal actuarial approach. Psychological Reports, 53, 1171–
of Personality Assessment, 44, 358–362. 1174.
Roach, A. J., Frazier, L. P., & Bowden, S. R. (1981). The Rogers, R., Ustad, K. L., Sewell, K. W., & Reinhardt,
Marital Satisfaction Scale: Development of a mea- V. (1996). Dimensions of incompetency: A factor
sure for intervention research. Journal of Marriage analytic study of the Georgia Court Competency
and the Family, 43, 537–545. Test. Behavioral Sciences and the Law, 14, 323–330.
Roberts, J. S., Laughlin, J. E., & Wedell, D. H. Roid, G. H. (1985). Computer-based test interpre-
(1999). Validity issues in the Likert and Thurstone tation: The potential of quantitative methods of test
approaches to attitude measurement. Educational interpretation. Computers in Human Behavior, 1,
and Psychological Measurement, 59, 211–233. 207–219.
Roberts, R. E., Vernon, S. W., & Rhoades, H. M. (1989). Rokeach, M. (1973). The nature of human values. New
Effects of language and ethnic status on reliabil- York: Free Press.
ity and validity of the Center for Epidemiologic Rome, H. P., Swenson, W. M., Mataya, P., McCarthy,
Studies – Depression Scale with psychiatric patients. C. E., Pearson, J. S., & Keating, R. F. (1962). Sym-
Journal of Nervous and Mental Disease, 177, 581–592. posium on automation techniques in personality
Robertson-Tchabo, E. A., & Arenberg, D. (1989). assessment. Proceedings of the Mayo Clinic, 37, 61–
Assessment of memory in older adults. In T. Hunt 82.
& C. J. Lindley (Eds.), Testing older adults (pp. 200– Romero, I. (1992). Individual assessment proce-
231). Austin, TX: Pro-Ed. dures with preschool children. In E. V. Nuttall, I.
Robertson, A., & Cochrane, R. (1973). The Wilson- Romero, & J. Kalesnik (Eds.), Assessing and screen-
Patterson Conservatism Scale: A reappraisal. British ing preschoolers (pp. 55–66). Boston: Allyn & Bacon.
Journal of Social and Clinical Psychology, 12, 428– Ronan, G. F., Colavito, V. A., & Hammontree, S. R.
430. (1993). Personal problem-solving system for scoring
Robinson, E. L., & Nagle, R. J. (1992). The comparabil- TAT responses: Preliminary validity and reliability
ity of the Test of Cognitive Skills with the Wechsler data. Journal of Personality Assessment, 61, 28–40.
Intelligence Scale for Children – Revised and the Roper, B. L., Ben-Porath, Y., & Butcher, J. N. (1995).
Stanford-Binet: Fourth edition with gifted children. Comparability and validity of computerized adap-
Psychology in the Schools, 29, 107–112. tive testing with the MMPI-2. Journal of Personality
Robinson, J. P., Athanasiou, R., & Head, K. B. (1969). Assessment, 65, 358–371.
Measures of occupational attitudes and occupational Rorer, L. G. (1965). The great response-style myth.
characteristics. Ann Arbor, MI: Institute for Social Psychological Bulletin, 63, 129–156.
Research. Rosecranz, H. A., & McNevin, T. E. (1969). A factor
Robinson, J. P., Rusk, J. G., & Head, K. B. (1968). Mea- analysis of attitudes toward the aged. The Gerontol-
sures of political attitudes. Ann Arbor, MI: Institute ogist, 9, 55–59.
for Social Research. Rosen, A. (1967). Limitations of personality invento-
Robinson, J. P., & Shaver, P. R. (1973). Measures of social ries for assessment of deaf children and adults as
psychological attitudes. Ann Arbor, MI: Institute for illustrated by research with the MMPI. Journal of
Social Research. Rehabilitation of the Deaf , 1, 47–52.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

602 References

Rosen, E. (1956). Self-appraisal, personal desirability Rounds, J. B. (1989). Review of the Career Assess-
and perceived social desirability of personality traits. ment Inventory. In J. C. Conoley & J. J. Kramer
Journal of Abnormal and Social Psychology, 52, 151– (Eds.), The tenth mental measurements yearbook
158. (pp. 139–141). Lincoln, NE: University of Nebraska
Rosen, R. C., & Beck, J. G. (1988). Patterns of sexual Press.
arousal. New York: Guilford Press. Rourke, B. P., & Fuerst, D. R. (1991). Learning disabil-
Rosen, W. G., Motts, R. C., & Davis, K. L. (1984). A ities and psychosocial functioning: A neuropsycholog-
new rating scale for Alzheimer’s disease. American ical perspective. New York: Guilford Press.
Journal of Psychiatry, 141, 1356–1364. Rowe, D. C., & Plomin, R. (1977). Temperament in
Rosenbaum, C. P., & Beebe, J. E. (1975). Psychi- early childhood. Journal of Personality Assessment,
atric treatment: Crisis, clinic, consultation. New York: 41, 150–156.
McGraw-Hill. Rubenstein, C., & Shaver, P. (1982). In search of inti-
Rosenberg, M. J., Hovland, C. I., McGuire, W. J., Abel- macy. New York: Delacorte Press.
son, R. P., & Brehm, J. W. (1960). Attitude organi- Rubenzer, S. (1992). A comparison of traditional and
zation and change. New Haven, CT: Yale University computer-generated psychological reports in an
Press. adolescent inpatient setting. Journal of Clinical Psy-
Rosenstock, I. M. (1974). Historical origins of the chology, 48, 817–826.
health belief model. Health Education Monographs, Ruebhausen, O. M., & Brim, O. G., Jr. (1966). Privacy
2, 238–335. and behavioral research. American Psychologist, 21,
Rosenthal, B. L., & Kamphaus, R. W. (1988). Inter- 423–437.
pretive tables for test scatter on die Stanford-Binet Ruebush, B. K. (1963). Anxiety. In H. W. Stevenson, J.
Intelligence Scale: Fourth edition. Journal of Psy- Kagan, & C. Spiker (Eds.), NSSE sixty-second year-
choeducational Assessment, 6, 359–370. book, Part I: Child psychology. Chicago, IL: University
Ross, C. E., & Minowsky, J. (1979). A comparison of of Chicago Press.
life-event weighting scheles: Change, undesirability, Ruesch, J., Loeb, M. B., & Jacobson, A. (1948). Accul-
and effect-proportional indices. Journal of Health turation and disease. Psychological Monographs:
and Social Behavior, 20, 166–177. General and Applied, 292, 1–40.
Ross, D. R. (1970). A technique of verbal assessment of Rulon, P. J. (1939). A simplified procedure for deter-
deaf students. Journal of Rehabilitation of the Deaf , mining the reliability of a test of split-halves. Har-
3, 7–15. vard Educational Review, 9, 99–103.
Ross, L. M., & Pollio, H. R. (1991). Metaphors of death: Runco, M. A., & Mraz, W. (1992). Scoring divergent
A thematic analysis of personal meanings. Omega, thinking tests using total ideational output and a
23, 291–307. creativity index. Educational and Psychological Mea-
Rossi, P. H., Wright, J. D., & Anderson, A. B. (Eds.). surement, 52, 213–221.
(1983). Handbook of survey research. New York: Aca- Ruschival, M. A., & Way, J. G. (1971). The WPPSI and
demic Press. the Stanford-Binet: A validity and reliability study
Rossini, E. D., & Kaspar, J. C. (1987). The validity of using gifted preschool children. Journal of Consult-
the Bender-Gestalt emotional indicators. Journal of ing and Clinical Psychology, 37, 163.
Personality Assessment, 51, 254–261. Russell, D., Peplau, L. A., & Cutrona, C. E. (1980).
Rossman, J. (1931). The psychology of the inventor. The Revised UCLA Loneliness Scale: Concurrent
Washington, DC: Inventors. and discriminant validity evidence. Journal of Per-
Rothbart, M. K. (1981). Measurement of temperament sonality and Social Psychology, 39, 472–480.
in infants. Child Development, 52, 569–578. Russell, D., Peplau, L. A., & Ferguson, M. L. (1978).
Rotter, J. B. (1946). The incomplete sentences test as a Developing a measure of loneliness. Journal of Per-
method of studying personality. American Psychol- sonality Assessment, 42, 290–294.
ogist, 1, 286. Rust, J. O., & Lose, B. D. (1980). Screening for gift-
Rotter, J. B. (1966). Generalized expectancies for inter- edness with the Slosson and the Scale for Rating
nal vs. external control of reinforcement. Psycholog- the Behavioral Characteristics of Superior Students.
ical Monographs, 80(Whole No. 609). Psychology in the Schools, 17, 446–451.
Rotter, J. B. (1967). A new scale for the measurement of Rutter, M. (1973). The assessment and treatment of
interpersonal trust. Journal of Personality, 35, 651– preschool autistic children. Early Child Development
665. and Care, 3, 13–29.
Rotter, J. B., & Rafferty, J. E. (1950). Manual: The Rotter Ryan, A. M., & Sackett, P. R. (1987a). A survey of indi-
Incomplete Sentences Blanks. New York: The Psycho- vidual assessment practices by I/O psychologists.
logical Corporation. Personnel Psychology, 40, 455–488.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 603

Ryan, A. M., & Sackett, P. R. (1987b). Pre-employment 1979–1987. Journal of the American Medical Associ-
honesty testing: Fakability, reactions of test takers, ation, 262, 365–369.
and company image. Journal of Business and Psy- Salthouse, T. A. (1986). Functional age: Examination
chology, 1, 248–256. of a concept. In J. E. Birren, P. K. Robinson, &
Ryan, J. J. (1981). Clinical utility of a WISC-R short J. E. Livingston (Eds.), Age, health and employment
form. Journal of Clinical Psychology, 37, 389–391. (pp. 78–92). Englewood Cliffs, NJ: Prentice Hall.
Ryan, R. M. (1987). Thematic Apperception Test. In Saltzman, J., Strauss, E., Hunter, M., & Spellacy, F.
D. J. Keyser & R. C. Sweetland (Eds.), Test critiques (1998). Validity of the Wonderlic Personnel Test as a
compendium (pp. 517–532). Kansas City, MO: Test brief measure of intelligence in individuals referred
Corporation of America. for evaluation of head injury. Archives of Clinical
Rybstein-Blinchik. E. (1979). Effects of different cog- Neuropsychology, 13, 611–616.
nitive strategies on chronic pain experience. Journal Samelson, F. (1977). World War I intelligence testing
of Behavioral Medicine, 2, 93–101. and the development of psychology. Journal of the
Ryman, D. H., Naitoh, P., Englund, C., & Genser, S. G. History of the Behavioral Sciences, 13, 274–282.
(1988). Computer response time measurements of Samuda, R. J. (1975). Psychological testing of American
mood, fatigue, and symptom scale items: Impli- minorities: Issues and consequences. New York: Dodd,
cations for scale response time uses. Computers in Mead, & Co.
Human Behavior, 4, 95–109. Sanchez, G. I. (1932). Scores of Spanish-speaking chil-
Sabatelli, R. M. (1988). Measurement issues in mar- dren on repeated tests. Journal of Genetic Psychology,
ital research: A review and critique of contempo- 40, 223–231.
rary survey instruments. Journal of Marriage and Sanchez, G. I. (1934). Blingualism and mental mea-
the Family, 50, 891–917. sures. Journal of Applied Psychology, 18, 765–772.
Saccuzzo, D. P., Higgins, G., & Lewandowski, D. Sanchez, R., & Atkinson, D. (1983). Mexican-
(1974). Program for psychological assessment of law American cultural commitment, preference for
enforcement officers: Initial evaluation. Psychologi- counselor ethnicity, and willingness to use counsel-
cal Reports, 35, 651–654. ing. Journal of Counseling Psychology, 30, 215–220.
Saccuzzo, D. P., & Lewandowski, D. G. (1976). The Sandberg, S. T., Wieselberg, M., & Shaffer, D. (1980).
WISC as a diagnostic tool. Journal of Clinical Psy- Hyperkinetic and conduct problem children in a
chology, 32, 115–124. primary school population: Some epidemiological
Sachs, B., Trybus, R., Koch, H., & Falberg, R. (1974). considerations. Journal of Child Psychology and Psy-
Current developments in the psychological evalua- chiatry and Allied Disciplines, 21, 293–311.
tion of deaf individuals. Journal of Rehabilitation of Sandler, I. S., & Guenther, R. T. (1985). Assessment of
the Deaf , 8, 136–140. life stress events. In P. Karoly (Ed.), Measurement
Sackett, P. R. (1987). Assessment centers and content strategies in health psychology (pp. 555–600). New
validity: Some neglected issues. Personnel Psychol- York: Wiley.
ogy, 40, 13–25. Sandoval, J. (1979). The WISC-R and internal evidence
Sackett, P. R., Burris, L. R., & Callahan, C. (1989). of test bias with minority groups. Journal of Consult-
Integrity testing for personnel selection: An update. ing and Clinical Psychology, 47, 919–927.
Personnel Psychology, 42, 491–529. Sandoval, J. (1981). Format effects in two Teacher
Sackett, P. R., & Dreher, G. F. (1982). Constructs Rating Scales of Hyperactivity. Journal of Abnormal
and assessment center dimensions: Some troubling Child Psychology, 9, 203–218.
empirical findings. Journal of Applied Psychology, 67, Sandoval, J. (1985). Review of the System of Multicul-
401–410. tural Pluralistic Assessment. In J. V. Mitchell (Ed.),
Sackett, P. R., & Harris, M. M. (1984). Honesty test- The ninth mental measurements yearbook (pp. 1521–
ing for personnel selection: A review and critique. 1525). Lincoln, NE: University of Nebraska Press.
Personnel Psychology, 37, 221–245. Sanford, E. C. (1987). Biography of Granville Stanley
Sackheim, H. A., & Gur, R. C. (1979). Self-deception, Hall. The American Journal of Psychology, 100, 365–
other-deception, and self-reported psychopathol- 375.
ogy. Journal of Consulting and Clinical Psychology, Santor, D. A., & Coyne, J. C. (1997). Shortening the
47, 213–215. CES-D to improve its ability to detect cases of
Saks, M. J. (1976). The limits of scientific jury selection: depression. Psychological Assessment, 9, 233–243.
Ethical and empirical. Jurimetrics Journal, 17, Sapinkopf, R. C. (1978). A computer adaptive testing
3–22. approach to the measurement of personality vari-
Salive, M. E., Smith, G. S., & Brewer, T. F. (1989). Sui- ables. Dissertation Abstracts International, 38, 1OB,
cide mortality in the Maryland state prison system, 4993.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

604 References

Sapp, S. G., & Harrod, W. J. (1993). Reliability and Sawyer, J. (1966). Measurement and prediction, clin-
validity of a brief version of Levenson’s Locus of ical and statistical. Psychological Bulletin, 66, 178–
Control Scale. Psychological Reports, 72, 539–550. 200.
Sarason, I. G. (1958). Interrelationships among indi- Saxe, S. J., & Reiser, M. (1976). A comparison of
vidual difference variables, behavior in psychother- three police applicant groups using the MMPI. Jour-
apy, and verbal conditioning. Journal of Abnormal nal of Police Science and Administration, 4, 419–
and Social Psychology, 56, 339–344. 425.
Sarason, I. G. (1960). Empirical findings and theoret- Saylor, C. F., Finch, A. J. Jr., Furey, W., Baskin, C. H., &
ical problems in the use of anxiety scales. Psycholog- Kelly, M. M. (1984). Construct validity for measures
ical Bulletin, 57, 403–415. of childhood depression: Application of multi-trait-
Sarason, I. G. (1980) (Ed.). Test anxiety. Hillsdale, NJ: multimethod methodology. Journal of Consulting
Erlbaum. and Clinical Psychology, 52, 977–985.
Sarason, I. G., Johnson, J. H., & Siegel, J. M. (1979). Schaefer, C. E. (1967). Biographical correlates of sci-
Assessing the impact of life changes. In I. G. Sarason entific and artistic creativity in adolescents. Unpub-
& C. D. Spielberger (Eds.), Stress and anxiety lished doctoral dissertation, Fordham University,
(Vol. 6). New York: Wiley. New York.
Sarason, S. B., Davidson, K. S., Lighthall, F. F., Waite, Schaefer, C. E., & Anastasi, A. (1968). A biographi-
R. R., & Ruebush, B. K. (1960). Anxiety in elementary cal inventory for identifying creativity in adolescent
school children. New York: Wiley. boys. Journal of Applied Psychology, 52, 42–48.
Sarason, S. B., & Mandler, G. (1952). Some correlates Schakel, J. A. (1986). Cognitive assessment of
of test anxiety. Journal of Abnormal and Social Psy- preschool children. School Psychology Review, 15,
chology, 47, 810–817. 200–215.
Sarbin, T. R. (1943). A contribution to the study Scherer, P. (1983). Psychoeducational evaluation of
of actuarial and individual methods of prediction. hearing-impaired preschool children. American
American Journal of Sociology, 48, 593–602. Annals of the Deaf, 128, 118–124.
Sattler, J. M. (1974). Assessment of children’s Schetz, K. (1985). Comparison of the Compton
intelligence. Philadelphia, PA: W. B. Saunders. Speech and Language Screening Evaluation with the
Sattler, J. M. (1982). Assessment of children’s intelligence Fluharty Preschool Speech and Language Screen-
and special abilities (2nd ed.). Boston, MA: Allen & ing Test. Journal of the American Speech-Language-
Bacon. Hearing Association, 16, 16–24.
Sattler, J. M. (1988). Assessment of children (3rd ed.). Schiff, M. (1977). Hazard adjustment, locus of control,
San Diego, CA: Author. and sensation seeking: Some null findings. Environ-
Sattler, J. M., & Gwynne, J. (1982).White examiners ment and Behavior, 9, 233–254.
generally do not impede the intelligence test perfor- Schippmann, J. S., Prien, E. P., & Katz, J. A. (1990). Reli-
mance of black children: To debunk a myth. Journal ability and validity of in-basket performance mea-
of Consulting and Clinical Psychology, 50, 196–208. sures. Personnel Psychology, 43,837–859.
Sattler, J. M., & Theye, F. (1967). Procedural, situ- Schirmer, B. R. (1993). Constructing meaning from
ational, and interpersonal variables in individual narrative text. American Annals of the Deaf, 138, 397–
intelligence testing. Psychological Bulletin, 68, 347– 403.
360. Schmid, K. D., Rosenthal, S. L., & Brown, E. D. (1988).
Sattler, J. M., & Tozier, L. L. (1970). A review of intelli- A comparison of self-report measures of two family
gence test modifications used with cerebral palsied dimensions: Control and cohesion. The American
and other handicapped groups. The Journal of Spe- Journal of Family Therapy, 16, 73–77.
cial Education, 4, 391–398. Schmidt, F. L. (1991). Why all banding procedures in
Sauer, W. J., & Warland, R. (1982). Morale and life sat- personnel selection are logically flawed. Human Per-
isfaction. In D. J. Mangen & W. A. Peterson (Eds.), formance, 4, 265–278.
Research instruments in social gerontology (Vol. 1, Schmidt, F. L. (1985). Review of Wonderlic Person-
pp. 195–240). Minneapolis, MN: University of Min- nel Test. In J. V. Mitchell, Jr. (Ed.), The ninth
nesota Press. mental measurements yearbook (Vol. II, pp. 1755–
Sawicki, R. F., Leark, R., Golden. C. J., & Karras, D. 1757). Lincoln, NE: University of Nebraska
(1984). The development of the pathognomonic, Press.
left sensorimotor and right sensorimotor scales for Schmidt, F. L., & Hunter, J. E. (1974). Racial and ethnic
the Luria-Nebraska Neuropsychological Battery – bias in psychological tests: Divergent implications of
Children’s Revision. Journal of Clinical Child Psy- two definitions of test bias. American Psychologist,
chology, 13, 165–169. 29, 1–8.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 605

Schmidt, F. L., & Hunter, J. E. (1977). Development of Schoenfeldt, L. F. (1989). Guidelines for computer-
a general solution to the problem of validity gener- based psychological tests and interpretations. Com-
alization. Journal of Applied Psychology, 62, 529–540. puters in Human Behavior, 5, 13–21.
Schmidt, F. L., & Hunter, J. E. (1980). The future of Scholl, G., & Schnur, R. (1976). Measures of psycholog-
criterion-related validity. Personnel Psychology, 33, ical, vocational, and educational functioning in the
41–60. blind and visually handicapped. New York: Ameri-
Schmidt, F. L., & Hunter, J. E. (1981). Employment can Foundation for the Blind.
testing. Old theories and new research findings. Schopler, E., & Reichler, R. J. (1979). Individualized
American Psychologist, 36, 1128–1137. assessment and treatment for autistic and develop-
Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, mentally disabled children: Vol. I. Psychoeducational
G. S. (1979). Further tests of the Schmidt-Hunter profile. Baltimore, MD: University Park Press.
Bayesian validity generalization procedure. Person- Schrader, A. D., & Osburn, H. G. (1977). Biodata
nel Psychology, 32, 257–281. faking: Effects of induced subtlety and position
Schmidt, F. L., Urry, V. W., & Gugel, J. F. (1978). Com- specificity. Personnel Psychology, 30, 395–404.
puter assisted tailored testing: Examinee reactions Schrader, W. B. (1971). The predictive validity of Col-
and evaluations. Educational and Psychological Mea- lege Board admissions tests. In W. H. Angoff (Ed.),
surement, 38, 265–273. The College Board admissions testing program (pp.
Schmidt, N., & Sermat, V. (1983). Measuring loneliness 117–145). New York: College Entrance Examination
in different relationships. Journal of Personality and Board.
Social Psychology, 44, 1038–1047. Schretlen, D. J. (1988). The use of psychological tests to
Schmit, M. J., & Ryan, A. M. (1993). The Big Five in identify malingered symptoms of mental disorder.
personnel selection: Factor structure in applicant Clinical Psychology Review, 8, 451–476.
and nonapplicant populations. Journal of Applied Schretlen, D., & Arkowitz, H. (1990). A psychologi-
Psychology, 78, 966–974. cal test battery to detect prison inmates who fake
Schmitt, F. A., & Ranseen, J. D. (1989). Neuropsy- insanity or mental retardation. Behavioral Sciences
chological assessment of older adults. In T. Hunt & and the Law, 8, 75–84.
C. J. Lindley (Eds.), Testing older adults (pp. 51–69). Schretlen, D., Wilkins, S. S., Van Gorp, W. G., & Bob-
Austin, TX: Pro-Ed. holz, J. H. (1992). Cross-validation of a psychologi-
Schmitt, N., Gilliland, S. W., Landis, R. S., & Devine, D. cal test battery to detect faked insanity. Psychological
(1993). Computer-based testing applied to selection Assessment, 4, 77–83.
of secretarial applicants. Personnel Psychology, 46, Schriesheim, C. A. (1981). The effect of grouping or
149–165. randomizing items on leniency response bias. Edu-
Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. cational and Psychological Measurement, 41, 401–
(1984). Metaanalysis of validity studies published 411.
between 1964 and 1982 and the investigation of Schriesheim, C. A., & DeNisi, A. A. (1980). Item pre-
study characteristics. Personnel Psychology, 37, 407– sentation as an influence on questionnaire validity:
422. A field experiment. Educational and Psychological
Schmitt, N., & Robertson, I. (1990). Personnel selec- Measurement, 40, 175–182.
tion. Annual Review of Psychology, 41, 289–319. Schriesheim, C. A., Eisenbach, R. J., & Hill, K. D.
Schneider, B. (1983). An interactionist perspective on (1991). The effect of negation and polar oppo-
organizational effectiveness. In K. S. Cameron & site item reversals on questionnaire reliability and
D. A. Whetten (Eds.), Organizational effectiveness: A validity: An experimental investigation. Educational
comparison of multiple models (pp. 27–54). Orlando, and Psychological Measurement, 51, 67–78.
FL: Academic Press. Schriesheim, C. A., & Klich, N. R. (1991). Fiedler’s
Schneider, W. H. (1992). After Binet: French intelli- Least Preferred Coworker (LPC) Instrument: An
gence testing, 1900–1950. Journal of the History of investigation of its true bipolarity. Educational and
the Behavioral Sciences, 28, 111–132. Psychological Measurement, 51, 305–315.
Schneider, J. M., & Parsons, O. A. (1970). Categories on Schroeder, L. D., Sjoquist, D. L., & Stephan, P. E. (1986).
the locus of control scale and cross-cultural comm- Understanding regression analysis. Newbury Park,
parisons in Denmark and the United States. Journal CA: Sage.
of Cross-Cultural Psychology, 1, 131–138. Schuessler, K. F. (1961). A note on statistical signifi-
Schoenfeldt, L. F. (1985). Review of Wonderlic Person- cance of scalogram. Sociometry, 24, 312–318.
nel Test. In J. V. Mitchell, Jr. (Ed.), The ninth men- Schuh, A. J. (1967). The predictability of employee
tal measurements yearbook (Vol. II, pp. 1757–1758). tenure: A review of the literature. Personnel Psychol-
Lincoln, NE: University of Nebraska Press. ogy, 20, 133–152.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

606 References

Schuldberg, D. (1988). The MMPI is less sensitive to Serwer, B. J., Shapiro, B. J., & Shapiro, P. P. (1972).
the automated testing format than it is to repeated Achievement prediction of “high risk” children. Per-
testing: Item and scale effects. Computers in Human ceptual and Motor Skills, 35, 347–354.
Behavior, 4, 285–298. Sewell, T. E. (1977). A comparison of WPPSI and
Schuman, H., & Kalton, G. (1985). Survey methods. In Stanford-Binet Intelligence Scale (1972) among
G. Lindzey & E. Aronson (Eds.), Handbook of social lower SES black children. Psychology in the Schools,
psychology (Vol. 1, 3rd ed.). New York: Random 14, 158–161.
House. Shah, C. P., & Boyden, M. F. H. (1991). Assessment
Schutte, N. S., & Malouff, J. M. (1995). Sourcebook of of auditory functioning. In B. A. Bracken (Ed.),
adult assessment strategies. New York: Plenum Press. The psychoeducational assessment of preschool chil-
Schwab, D. P., & Oliver, R. L. (1974). Predicting tenure dren (2nd ed., pp. 341–378). Boston, MA: Allyn &
with biographical data: Exhuming buried evidence. Bacon.
Personnel Psychology, 27, 125–128. Shapiro, E. S. (1987). Behavioral assessment in school
Schwab, J. J., Bialow, M., Brown, J. M., & Holzer, C. E. psychology. Hillsdale, NJ: Erlbaum.
(1967). Diagnosing depression in medical inpa- Sharp, S. E. (1898–1899). Individual psychology: A
tients. Annals of Internal Medicine, 67, 695–707. study in psychological method. American Journal
Scissons, E. H. (1976). Computer administration of of Psychology, 10, 329–391.
the California Psychological Inventory. Measure- Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989).
ment and Evaluation in Guidance, 9, 24–30. Generalizability theory. American Psychologist, 44,
Scogin, F., Schumacher, J., Gardner, J., & Chaplin, W. 922–932.
(1995). Predictive validity of psychological testing Shaycoft, M. F. (1979). Handbook of criterion-
in law enforcement settings. Professional Psychology, referenced testing: Development, evaluation, and use.
26, 68–71. New York: Garland STPM Press.
Scott, R. D., & Johnson, R. W. (1967). Use of the Sheehan, K. R., & Gray, M. R. (1991). Sex bias in the
weighted application blank in selecting unskilled SAT and the DTMS. The Journal of General Psychol-
employees. Journal of Applied Psychology, 51, 393– ogy, 119, 5–14.
395. Shepard, L. (1980). Standard setting issues and meth-
Scully, J. A., Tosi, H., & Banning, K. (2000). Life event ods. Applied Psychological Measurement, 4, 447–467.
checklists: Revisiting the Social Readjustment Rat- Sherer, M., Parsons, O. A., Nixon, S., & Adams, R. L.
ing Scale after 30 years. Educational and Psychologi- (1991). Clinical validity of the Speech Sounds Per-
cal Measurement, 60, 864–876. ception Test and the Seashore Rhythm Test. Journal
Seamons, D. T., Howell, R. J., Carlisle, A. L., & Roe, of Clinical and Experimental Neuropsychology, 13,
A. V. (1981). Rorschach simulation of mental illness 741–751.
and normality by psychotic and non-psychotic legal Sherman, S. W., & Robinson, N. M. (Eds.). (1982).
offenders. Journal of Personality Assessment, 45, 130– Ability testing of handicapped people: Dilemma for
135. government, science, and the public. Washington,
Seashore, H. G., Wesman, A. G., & Doppelt, J. E. (1950). DC: National Academy Press.
The standardization of the Wechsler Intelligence Sherry, D. L., & Piotrowski, C. (1986). Consistency
Scale for Children. Journal of Consulting Psychology, of factor structure on the Semantic Differential: An
14, 99–110. analysis of three adult samples. Educational and Psy-
Segal, D. L., Hersen, M., & Van Hasselt, V. B. (1994). chological Measurement, 46, 263–268.
Reliability of the Structured Clinical Interview for Shimberg, B. (1981). Testing for licensure and certifi-
DSM-III-R: An evaluative review. Comprehensive cation. American Psychologist, 36, 1138–1146.
Psychiatry, 35, 316–327. Shneidman, E. S. (Ed.). (1951), Thematic test analysis.
Segal, D. L., Hersen, M., Van Hasselt, V. B., Kabacoff, New York: Grune & Stratton.
R. I., & Roth, L. (1993). Reliability of diagnosis Shore, T. H., Thornton, G. C., III, & Shore, L. M.
in older psychiatric patients using the Struc- (1990). Construct validity of two categories of
tured Clinical Interview for DSM-III-R. Journal assessment center dimension ratings. Personnel Psy-
of Psychopathology and Behavioral Assessment, 15, chology, 43, 101–116.
347–356. Shostrom, E. L. (1963). Personal Orientation Inventory.
Seligman, M. (1975). Helplessness. New York: Freeman. San Diego, CA: Education and Industrial Testing
Selzer, M. L. (1971). The Michigan Alcoholism Screen- Service.
ing Test: The quest for a new diagnostic instru- Shostrom, E. L. (1964). A test for the measurement
ment. American Journal of Psychiatry, 127, 1653– of self-actualization. Educational and Psychological
1658. Measurement, 24, 207–218.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 607

Shrauger, J. S., & Osberg, T. M. (1981). The relative ness of fit in early intervention. Infant Mental Health
accuracy of self-prediction and judgments by others Journal, 7, 81–94.
in psychological assessment. Psychological Bulletin, Simeonsson, R. J., Buckley, L., & Monson, L. (1979).
90, 322–351. Concepts of illness causality in hospitalized chil-
Shrout, P. E., Spitzer, R. L., & Fleiss, J. L. (1987). dren. Journal of Pediatric Psychology, 4, 77–84.
Quantification of agreement in psychiatric diagno- Simeonsson, R. J., Huntington, G. S., & Parse, S. A.
sis revisited. Archives of General Psychiatry, 44, 172– (1980). Expanding the developmental assessment
177. of young handicapped children. New Directions for
Shultz, K. S., & Chavez, D. V. (1994). The reliability Exceptional Children, 3, 51–74.
and factor structure of a social desirability scale in Simola, S. K., & Holden, R. R. (1992). Equiva-
English and in Spanish. Educational and Psycholog- lence of computerized and standard administra-
ical Measurement, 54, 935–940. tion of the Piers-Harris Children’s Self-Concept
Sibley, S. (1989). Review of the WISC-Riter ‘Complete’. Scale. Journal of Personality Assessment, 58, 287–
In J. C. Conoley & J. J. Kramer (Eds.), The tenth men- 294.
tal measurements yearbook (pp. 892–893). Lincoln, Singer, E., & Presser, S. (1989). Survey research meth-
NE: University of Nebraska Press. ods. Chicago, IL: University of Chicago Press.
Sicoly, F. (1992). Estimating the accuracy of decisions Sipos, K., Sipos, M., & Spielberger, C. D. (1985).
based on cutting scores. Journal of Psychoeducational The development and validation of the Hungar-
Assessment, 10, 26–36. ian form of the Test Anxiety Inventory. In H. M.
Sidick, J. T., Barrett, G. V., & Doverspike, D. (1994). Van Der Ploeg, R. Schwarzer, & C. D. Spielberger
Three-alternative multiple choice tests: An attractive (Eds.), Advances in Test Anxiety Research (Vol.
option. Personnel Psychology, 47, 829–835. 4, pp. 221–228). Lisse, the Netherlands: Swets &
Siebel, C. C., Faust, W. L., & Faust, M. S. (1971). Zeitlinger.
Administration of design copying tests to large Siu, A. L., Reuben, D. B., & Hayes, R. D. (1990). Hierar-
groups of children. Perceptual and Motor Skills, 32, chical measures of physical function in ambulatory
355–360. geriatrics. Journal of the American Geriatric Society,
Siegel, D. J., & Piotrowski, R. J. (1985). Reliability of 38, 1113–1119.
K-ABC subtest composites. Journal of Psychoeduca- Skinner, H. A., & Allen, B. A. (1983). Does the com-
tional Assessment, 3, 73–76. puter make a difference? Computerized vs. face-to-
Silverman, B. I., Bishop, G. F., & Jaffe, J. (1976). Psy- face vs. self-report assessment of alcohol, drug, and
chology of the scientist: XXXV. Terminal and instru- tobacco use. Journal of Consulting and Clinical Psy-
mental values of American graduate students in Psy- chology, 51, 267–275.
chology. Psychological Reports, 39, 1099–1108. Skinner, H. A., & Lei, H. (1980). Differential weights
Silverstein, A. B. (1967). Validity of WISC short forms in life change research: Useful or irrelevant? Psycho-
at three age levels. California Mental Health Research somatic Medicine, 42, 367–370.
Digest, 5, 253–254. Slate, J. R., & Hunnicutt, L. C., Jr. (1988). Examiner
Silverstein, A. B. (1968). Validity of a new approach errors on the Wechsler scales. Journal of Psychoedu-
to the design of WAIS, WISC, and WPPSI short cational Assessment, 6, 280–288.
forms. Journal of Consulting and Clinical Psychology, Slomka, G. T., & Tarter, R. E. (1984). Mental retarda-
32, 478–479. tion. In R. E. Tarter & G. Goldstein (Eds.), Advances
Silverstein, A. B. (1970). Reappraisal of the validity of in clinical neuropsychology (Vol. 2, pp. 109–137).
WAIS, WISC, and WPPSI short forms. Journal of NY: Plenum Press.
Consulting and Clinical Psychology, 34, 12–14. Slosson, R. L. (1963). Slosson Intelligence Test (SIT) for
Silverstein, A. B. (1971). Deviation social quotients children and adults. New York: Slosson Educational.
for the Vineland Social Maturity Scale. American Slosson, R. L. (1991). Slosson Intelligence Test (SIT-R).
Journal of Mental Deficiency, 76, 348–351. East Aurora, NY: Slosson Educational.
Silverstein, A. B. (1973). Factor structure of the Wech- Small, S. A., Zeldin, R. S., & Savin-Williams, R. C.
sler Intelligence Scale for Children for three ethnic (1983). In search of personality traits: A multi-
groups. Journal of Educational Psychology, 65, 408– method analysis of naturally occurring prosocial
410. and dominance behavior. Journal of Personality, 51,
Silverstein, A. B. (1984). Pattern analysis: The question 1–16.
of abnormality. Journal of Consulting and Clinical Smith, A. L., Hays, J. R., & Solway, K. S. (1977). Com-
Psychology, 52, 936–939. parison of the WISC-R and Culture Fair Intelligence
Simeonsson, R. J., Bailey, D. B., Huntington, G. S., & Test in a juvenile delinquent population. The Journal
Comfort, M. (1986). Testing the concept of good- of Psychology, 97, 179–182.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

608 References

Smith, D. K., St. Martin, M. E., & Lyon, M. A. (1989). A “Barnum effect” and beyond. Journal of Consulting
validity study of the Stanford-Binet: Fourth edition and Clinical Psychology, 45, 104–114.
with students with learning disabilities. Journal of Snyder, D. K. (1979). Multidimensional assessment
Learning Disabilities, 22, 260–261. of marital satisfaction. Journal of Marriage and the
Smith, J. D. (1985). Minds made feeble: The myth and Family, 41, 813–823.
legacy of the Kallikaks. Rockville, MD: Aspen Sys- Snyder, D. K. (1981). Manual for the Marital Satisfac-
tems. tion Inventory. Los Angeles, CA: Western Psycholog-
Smith, M. B. (1990). Henry A. Murray (1893-1988): ical Services.
Humanistic psychologist. Journal of Humanistic Psy- Snyder, D. K., Lachar, D., & Wills, R. M. (1988).
chology, 30, 6–13. Computer-based interpretation of the Marital Satis-
Smith, M. B., & Anderson, J. W. (1989). Obituary: faction Inventory: Use in treatment planning. Jour-
Henry A. Murray (1893-1988). American Psychol- nal of Marital and Family Therapy, 14, 397–409.
ogist 44, 1153–1154. Snyder, D. K., Widiger, T. A., & Hoover, D. W.
Smith, P. C., & Kendall, L. M. (1963). Retranslation (1990). Methodological considerations in validat-
of expectations: An approach to the construction of ing computer-based test interpretations: Control-
unambiguous anchors for rating scales. Journal of ling for response bias. Psychological Assessment, 2,
Applied Psychology, 47, 149–155. 470–477.
Smither, R. D., & Houston, J. M. (1992). The nature of Soares, A. T., & Soares, L. M. (1975). Self-Perception
competitiveness: the development and validation of Inventory Composite manual. Bridgeport, CT: Uni-
the competitiveness index. Educational and Psycho- versity of Bridgeport.
logical Measurement, 52, 407–418. Sokal, M. M. (Ed.). (1987). Psychological testing and
Smyth, F. L. (1989). Commercial coaching and American society 1890–1930. New Brunswick, NJ:
SAT scores. Journal of College Admissions, 123, Rutgers University Press.
2–9. Sokal, M. M. (1990). G. Stanley Hall and the institu-
Smyth, F. L. (1990). SAT coaching. Journal of College tional character of psychology at Clark 1889–1920.
Admissions, 129, 7–17. Journal of the History of the Behavioral Sciences, 26,
Snell, W. E., Jr. (1989). Development and validation of 114–124.
the Masculine Behavior Scale: A measure of behav- Sokal, M. M. (1991). Obituary: Psyche Cattell (1893–
iors stereotypically attributed to males vs. females. 1989). American Psychologist, 46, 72.
Sex Roles, 21, 749–767. Solomon, E., & Kopelman, R. E. (1984). Questionnaire
Snell, W. E., & Papini, D. R. (1989). The sexuality scale: format and scale reliability: An examination of three
An instrument to measure sexual-esteem, sexual- modes of item presentation. Psychological Reports,
depression, and sexual-preoccupation. The Journal 54, 447–452.
of Sex Research, 26, 256–263. Sonnenfeld, J. (1969). Personality and behavior in
Snider, J. G., & Osgood, C. E. (Eds.). (1969). Seman- environment. Proceedings of the Association of Amer-
tic Differential technique: A sourcebook. Chicago, IL: ican Geographers, 1, 136–140.
Aldine. Space, L. G. (1981). The computer as a psychometri-
Snow, J. H., & Hynd, G. W. (1985a). A multi- cian. Behavior Research Methods and Instrumenta-
variate investigation of the Luria-Nebraska Neu- tion, 13, 596–606.
ropsychological Battery – Children’s Revision with Spanier, G. B. (1976). Measuring dyadic adjustment:
learning-disabled children. Journal of Psychoeduca- New scales for assessing the quality of marriage and
tional Assessment, 3, 101–109. similar dyads. Journal of Marriage and the Family,
Snow, J. H., & Hynd, G. W. (1985b). Factor structure 38, 15–28.
of the Luria-Nebraska Neuropsychological Battery – Sparrow, S. S., Balla, D. A., & Cicchetti, D. V. (1984).
Children’s Revision. Journal of School Psychology, 23, Vineland Adaptive Behavior Scales. Circle Pines, MN:
271–276. American Guidance Service.
Snyder, C. R., Harris, C., Anderson, J. R., Holleran, Spearman, C. (1904). “General intelligence” objec-
S. A., Irving, L. M., Sigmon, S. T., Yoshinobu, tively determined and measured. American Journal
L., Gibb, J., Langelle, C., & Harney, P. (1991). of Psychology, 15, 201–293.
The Will and the Ways: Development and valida- Spearman, C. (1927). The abilities of man. London:
tion of an individual-differences measure of hope. Macmillan.
Journal of Personality and Social Psychology, 60, Spearman, C. (1930). Autobiography. In C. Murchison
570–585. (Ed.), A history of psychology in autobiography
Snyder, C. R., Shenkel, R. J., & Lowery, C. R. (1977). (Vol. 1, pp. 299–334). Worcester, MA: Clark Uni-
Acceptance of personality interpretations: The versity Press.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 609

Spector, P. (1982). Behavior in organizations as a func- Steer, R. A., Beck, A. T., & Garrison, B. (1986). Appli-
tion of employees’ locus of control. Psychological cations of the Beck Depression Inventory. In N. Sar-
Bulletin, 91, 482–497. torius & T. A. Ban (Eds.), Assessment of depression
Spector, P. E. (1988). Development of the Work Locus (pp. 123–142). New York: Springer-Verlag.
of Control Scale. Journal of Occupational Psychology, Steer, R. A., Rissmiller, D. J., Ranieri, W. F., & Beck,
61, 335–340. A. T. (1994). Use of the computer-administered Beck
Spielberger, C. D. (1966) (Ed.), Anxiety and behavior. Depression Inventory and Hopelessness Scale with
New York: Academic Press. psychiatric inpatients. Computers in Human Behav-
Spielberger, C. D. (1980). Test Anxiety Inventory. Palo ior, 10, 223–229.
Alto, CA: Consulting Psychologists Press. Stein, K. B. (1968). The TSC scales: The outcome
Spielberger, C. D., Gorsuch, R. L., & Lushene, R. E. of a cluster analysis of the 550 MMPI items. In P.
(1970). STAI Manual. Palo Alto, CA: Consulting McReynolds (Ed.), Advances in psychological assess-
Psychologists Press. ment (Vol. 1, pp. 80–104). Palo Alto, CA: Science
Spielberger, C. D., Gorsuch, R. L., Lushene, R., Vagg, and Behavior Books.
P. R., & Jacobs, G. A. (1983). Manual for the State- Stein, L. A. R., & Graham, J. R. (1999). Detecting fake-
Trait Anxiety Inventory: A “self-evaluation ques- good MMPI-A profiles in a correctional facility. Psy-
tionnaire.” Palo Alto, CA: Consulting Psychologists chological Assessment, 11, 386–395.
Press. Stein, M. I. (1981). Thematic Apperception Test (2nd
Spitzer, R. L., & Endicott, J. (1977). Schedule for Affec- ed.). Springfield, IL: Charles C Thomas.
tive Disorders and Schizophrenia – Life-time version Stein, S. P., & Charles, P. (1971). Emotional factors
(SADS-L). New York: New York State Psychiatric in juvenile diabetes mellitus: A study of early life
Institute. experience of adolescent diabetics. American Journal
Spitzer, R. L., & Williams, J. B. W. (1983). Instruc- of Psychiatry, 128, 56–60.
tion manual for the Structured Clinical Interview for Stephens, T. E., & Lattimore, J. (1983). Prescriptive
DSM-III (SCID). New York: Biometrics Research checklist for positioning multihandicapped residen-
Department, New York State Psychiatric Institute. tial clients: A clinical report. Physical Therapy, 63,
Spitzer, R. L., Williams, J. B. W., Gibbon, M., & First, 1113–1115.
M. B. (1992). The Structured Clinical Interview for Stephenson, W. (1953). The study of behavior. Chicago,
DSM-III-R (SCID). Archives of General Psychiatry, IL: University of Chicago Press.
49, 624–629. Stern, W. (1930). Autobiography. In C. Murchison
Spranger, E. (1928). Types of men. Translated from the (Ed.), A history of psychology in autobiography
5th German edition of Lebensformen by P. J. W. Pig- (Vol. 1, pp. 335–388). Worcester, MA: Clark Uni-
ors. Halle: Max Niemeyer Verlag. versity Press.
Spruill, J. (1988). Two types of tables for use with the Sternberg, R. J. (1984). The Kaufman Assessment Bat-
Stanford-Binet Intelligence Scale: Fourth edition. tery for Children: An information processing anal-
Journal of Psychoeducational Assessment, 6, 78–86. ysis and critique. The Journal of Special Education,
Spruill, J. (1991). A comparison of the Wechsler Adult 18, 269–279.
Intelligence Scale – Revised with the Stanford-Binet Sternberg, R. J. (1985). Beyond IQ: A triarchic theory
Intelligence Scale (4th ed.) for mentally retarded of human intelligence. Cambridge, MA: Cambridge
adults. Psychological Assessment, 3, 133–135. University Press.
Staats, S. (1989). Hope: A comparison of two self- Sternberg, R. J. (Ed.). (1988a). Advances in the psychol-
report measures for adults. Journal of Personality ogy of human intelligence (Vols. 1–4). Hillsdale, NJ:
Assessment, 53, 366–375. Erlbaum.
Staats, S. R., & Stassen, M. A. (1985). Hope: An affective Sternberg, R. J. (1988b). The triarchic mind. New York:
cognition. Social Indicators Research, 17, 235–242. Viking.
Staff (1988). Validity service update. GRE Board Sternberg, R. J. (1990). Metaphors of mind: Concep-
Newsletter, p. 3. tions of the nature of intelligence. Cambridge, MA:
Stanley, J. C., & Porter, A. C. (1967). Correlation of Cambridge University Press.
Scholastic Aptitude Test scores with college grades Sternberg, R. J., & Davidson, J. E. (1986). Conceptions
for Negroes vs. whites. Journal of Educational Mea- of giftedness. New York: Cambridge University.
surement, 4, 199–218. Sternberg, R. J., & Detterman, D. K. (1986). What is
Stanton, J. M. (1956). Group personality profiles intelligence? Norwood, NJ: Ablex.
related to aspects of antisocial behavior. Journal of Sternberg, R. J., & Grigorenko, E. L. (1997). Are cog-
Criminal Law, Criminology and Police Science, 47, nitive styles still in style? American Psychologist, 52,
340–349. 700–712.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

610 References

Sternberg, R. J., Wagner, R. K., Williams, W. M., & Hor- chotherapy. Journal of Consulting Psychology, 27, 95–
vath, J. A. (1995). Testing common sense. American 101.
Psychologist, 50, 912–927. Strong, E. K., Jr. (1935). Predictive value of the Voca-
Sternberg, R. J., & Williams, W. M. (1997). Does the tional Interest Test. Journal of Educational Psychol-
Graduate Record Examination predict meaningful ogy, 26, 331–349.
success in the graduate training of psychologists? Strong, E. K., Jr. (1955). Vocational interests 18 years
American Psychologist, 52, 630–641. after college. Minneapolis, MN: University of Min-
Stewart, A. J., & Chester, N. L. (1982). Sex differ- nesota Press.
ences in human social motives: Achievement, affil- Sue, D. W., & Sue, D. (1990). Counseling the culturally
iation, and power. In A. Stewart (Ed.), Motiva- different. New York: Wiley.
tion and society (pp. 172–218). San Francisco, CA: Suinn, R. M., Ahuna, C., & Khoo, G. (1992). The
Jossey-Bass. Suinn-Lew Asian Self-Identity Acculturation Scale:
Stewart, A. L., Hays, R. D., & Ware, J. E. (1988). The Concurrent and factorial validation. Educational
MOS short-form General Health Survey: Reliability and Psychological Measurement, 52, 1041–1046.
and validity in a patient population. Medical Care, Suinn, R., Dauterman, W., & Shapiro, B. (1967). The
26, 724–735. WAIS as a predictor of educational and occupational
Stokols, D. (1992). Establishing and maintaining achievement in the adult blind. New Outlook for the
healthy environments. American Psychologist, 47, 6– Blind, 61, 41–43.
22. Suinn, R., Rikard-Figueroa, K., Lew, S., & Vigil, P.
Stone, G. C., Cohen, F., & Adler, N. E. (Eds.). (1979). (1987). The Suinn-Lew Asian Self-Identity Accul-
Health psychology: A handbook. Washington, DC: turation Scale: An initial report. Educational and
Jossey-Bass. Psychological Measurement, 47, 401–407.
Stotland, E. (1969). The psychology of hope. San Fran- Sullivan, P. M. (1982). Administration modifications
cisco, CA: Jossey-Bass. on the WISC-R Performance Scale with different
Stotland, E., & Blumenthal, A. L. (1964). The reduction categories of deaf children. American Annals of the
of anxiety as a result of the expectation of making a Deaf, 127, 780–788.
choice. Canadian Journal of Psychology, 18, 139–145. Sullivan, P. M., & Vernon, M. (1979). Psychologi-
Strahan. R., & Gerbasi, K. C. (1972). Short, homoge- cal assessment of hearing-impaired children. School
neous versions of the Marlowe-Crowne Social Desir- Psychology Digest, 8, 271–290.
ability Scale. Journal of Clinical Psychology, 28, 191– Sutter, E. G., & Battin, R. R. (1984). Using tradi-
193. tional psychological tests to obtain neuropsycholog-
Street, W. R. (1994). A chronology of noteworthy events ical information on children. International Journal
in American psychology. Washington, DC: American of Clinical Neuropsychology, 6, 115–119.
Psychological Association. Swallow, R. (1981). Fifty assessment instruments com-
Streiner, D. L. (1985). Psychological Screening Inven- monly used with blind and partially seeing individ-
tory. In D. J. Keyser & R. C. Sweetland (Eds.), Test uals. Journal of Visual Impairment and Blindness, 75,
critiques (Vol. IV, pp. 509–515). Kansas City, MO: 65–72.
Test Corporation of America. Swanson, H. L., & Watson, B. L. (1989). Educational
Streiner, D. L., & Miller, H. R. (1986). Can a good and psychological assessment of exceptional children
short form of the MMPI ever be developed? Journal (2nd ed.). Columbus, OH: Merrill.
of Clinical Psychology, 42, 109–113. Swenson, W. M. (1985). An aging psychologist
Strenio, A. J. (1981). The testing trap. New York: Raw- assessess the impact of age on MMPI profiles. Psy-
son, Wade. chiatric Annals, 15, 554–557.
Stricker, L. J., & Ross, J. (1964a). Some correlates of a Swerdlik, M. E. (1977). The question of the compa-
Jungian personality inventory. Psychological Reports, rability of the WISC and WISC-R: Review of the
14, 623–643. research and implications for school psychologists.
Stricker, L. J., & Ross, J. (1964b). An assessment of Psychology in the Schools, 14, 260–270.
some structural properties of the Jungian person- Szapocznik, J., Kurtines, W. M., & Fernandez, T.
ality typology. Journal of Abnormal and Social Psy- (1980). Bicultural involvement and adjustment in
chology, 68, 62–71. Hispanic-American youths. International Journal of
Strickland, B. R. (1989). Internal-external control Intercultural Relations, 4, 353–365.
expectancies: From contingency to creativity. Amer- Szapocznik, J., Scopetta, M. A., Aranalde, M., &
ican Psychologist, 44, 1–12. Kurtines, W. (1978). Cuban value structure: Treat-
Strickland, B. R., & Crowne, D. P. (1963). Need for ment implications. Journal of Consulting and Clini-
approval and the premature termination of psy- cal Psychology, 46, 961–970.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 611

Tamkin, A. S., & Klett, C. J. (1957). Barron’s Ego Shah & B. D. Sales (Eds.), Law and mental health:
Strength Scale: A replication of an evaluation of its Major developments and research needs (pp. 149–
construct validity. Journal of Consulting Psychology, 183). Rockville, MD: U.S. Department of Health and
21, 412. Human Services.
Tashakkori, A., Barefoot, J., & Mehryar, A. H. (1989). Teplin, L., & Swartz, J. (1989). Screening for severe
What does the Beck Depression Inventory measure mental disorder in jails: The development of the
in college students? Evidence from a non-western Referral Decision Scale. Law and Human Behavior,
culture. Journal of Clinical Psychology, 45, 595–602. 13, 1–18.
Tasto, D. L. (1977). Self-report schedules and inven- terKuile, M. M., Linssen, A. C. G., & Spinhoven, P.
tories. In A. R. Ciminero, K. S. Calhoun, & H. E. (1993). The development of the Multidimensional
Adams (Eds.), Handbook of behavioral assessment Locus of Pain Control Questionnaire (MLPC): Fac-
(pp. 153–193). New York: Wiley. tor structure, reliability, and validity. Journal of Psy-
Tasto, D. L., & Hickson, R. (1970). Standardization, chopathology and Behavioral Assessment, 15, 387–
item analysis, and scaling of the 122-item Fear Sur- 404.
vey Schedule. Behavior Therapy, 1, 473–484. Terman, L. M. (1916). The measurement of intelligence.
Tasto, D. L., Hickson, R., & Rubin, S. E. (1971). Boston, MA: Houghton-Mifflin.
Scaled profile analysis of fear survey schedule fac- Terman, L. M. (1932). Autobiography. In C. Murchi-
tors. Behavior Therapy, 2, 543–549. son (Ed.), A history of psychology in autobiography
Tate, D. G., Forchheiner, M., Maynard, F., Davidoff, (Vol. 2, pp. 297–332). Worcester, MA: Clark Univer-
G., & Dijkers, M. (1993). Comparing two measures sity Press.
of depression in spinal cord injury. Rehabilitation Terman, L. M., & Childs, H. G. (1912). A tentative revi-
Psychology, 38, 53–61. sion and extension of the Binet-Simon measuring
Tatsuoka, M. (1970). Discriminant analysis: The study scale of intelligence. Journal of Educational Psychol-
of group differences. Champaign, IL: Institute for Per- ogy, 3, 61–74; 133–143; 198–208; 277–289.
sonality and Ability Testing. Terman, L. M., & Merrill, M. A. (1937). Measuring
Taylor, H. C., & Russell, J. T. (1939). The relationship of intelligence. Boston, MA: Houghton-Mifflin.
validity coefficients to the practical effectiveness Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Per-
of tests in selection. Discussion and tables. Journal sonality measures as predictors of job performance:
of Applied Psychology, 23, 565–578. A metaanalytic review. Personnel Psychology, 44,
Taylor, J. A. (1953). A personality scale of manifest 703–742.
anxiety. Journal of Abnormal and Social Psychology, Thayer, P. W. (1977). Somethings old, somethings new.
48, 285–290. Personnel Psychology, 30, 513–524.
Taylor, J. B. (1959). Social desirability and MMPI Thistlethwaite, D. L. (1960). College press and changes
performance: The individual case. Journal of Con- in study plans of talented students. Journal of Edu-
sulting Psychology, 23, 514–517. cational Psychology, 51, 222–234.
Taylor, R. L., Kauffman, D., & Partenio, I. (1984). Thomas, A., & Chess, S. (1977). Temperament and
The Koppitz developmental scoring system for the development. New York: Brunner/Mazel.
Bender-Gestalt: Is it developmental? Psychology in Thomas, K. R., Wiesner, S. L., & Davis, R. M. (1982).
the Schools, 21, 425–428. Semantic differential ratings as indices of disability
Taylor, R. L., Slocumb, P. R., & O’Neill, J. (1979). A acceptance. Rehabilitation Psychology, 27, 245–247.
short form of the McCarthy Scales of Children’s Thomas, P. J. (1980). A longitudinal comparison of the
Abilities: Methodological and clinical applications. WISC and WISC-R with special education students.
Psychology in the Schools, 16, 347–350. Psychology in the Schools, 17, 437–441.
Taylor, S. E. (1990). Health psychology. American Psy- Thompson, B., & Daniel, L. G. (1996). Factor analytic
chologist, 45, 40–50. evidence for the construct validity of scores: A his-
Templer, D. I. (1970). The construction and validation torical overview and some guidelines. Educational
of a Death Anxiety Scale. The Journal of General Psy- and Psychological Measurement, 56, 197–208.
chology, 82, 165–177. Thompson, B., & Vacha-Haase, T. (2000). Psychomet-
Tenopyr, M. L. (1977). Content-construct confusion. rics is datametrics: The test is not reliable. Educa-
Personnel Psychology, 30, 47–54. tional and Psychological Measurement, 60, 174–195.
Tenopyr, M. L., & Oeltjen, P. D. (1982). Personnel selec- Thompson, R. J. (1977). Consequences of using the
tion and classification. Annual Review of Psychology, 1972 Stanford Binet intelligence scale norms. Psy-
33, 581–618. chology in the Schools, 14, 445–448.
Teplin, L. (1991). The criminalization hypothesis: Thompson, R. J. (1980). The diagnostic utility of
Myth, misnomer, or management strategy. In S. A. WISC-R measures with children referred to a
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

612 References

developmental evaluation center. Journal of Consult- Tillman, H. M. (1973). Intelligence scales for the blind:
ing and Clinical Psychology, 48, 440–447. A review with implications for research. Journal of
Thorndike, E. L. (1926). Measurement of intelli- School Psychology, 11, 80–87.
gence. New York: Teacher’s College, Columbia Tilton, J. W. (1937). The measurement of overlapping.
University. Journal of Educational Psychology, 28, 656–662.
Thorndike, R. L. (1940). “Constancy” of the IQ. Psy- Timbrook, R. E., & Graham, J. R. (1994). Ethnic dif-
chological Bulletin, 37, 167–186. ferences on the MMPI-2. Psychological Assessment,
Thorndike, R. L. (1949). Personnel selection. New York: 6, 212–217.
Wiley. Timbrook, R. E., Graham, J. R., Keiller, S. W., &
Thorndike, R. L. (1971). Concepts of culture- Watts, D. (1993). Comparison of the Winener-
fairness. Journal of Educational Measurement, 8, 63– Harmon subtle-obvious scales and the standard
70. validity scales in detecting valid and invalid MMPI-2
Thorndike, R. L. (1977). Causation of Binet IQ decre- profiles. Psychological Assessment, 5, 53–61.
ments. Journal of Educational Measurement, 14, Timmons, L. A., Lanyon, R. I., Almer, E. R., & Curran,
197–202. P. J. (1993). Development and validation of sen-
Thorndike, R. L., & Hagen, E. P. (1977). Measurement tence completion test indices of malingering dur-
and evaluation in Psychology and Education (4th ed.). ing examination for disability. American Journal of
New York: Wiley. Forensic Psychology, 11, 23–38.
Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986a). Tolor, A. L., & Brannigan, G. G. (1980). Research
The Stanford-Binet Intelligence Scale: Fourth edition, and clinical applications of the Bender-Gestalt Test.
guide for administering and scoring. Chicago, IL: Springfield, II: Charles C Thomas.
Riverside. Torrance, E. P. (1966). Torrance tests of creative thinking.
Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986b). Norms – technical manual. Princeton, NJ: Personnel
The Stanford-Binet Intelligence Scale: Technical man- Press.
ual (4th ed.). Chicago, IL: Riverside. Torrance, E. P. (1981). Empirical validation of
Thorndike, R. M. (1990). Origins of intelligence and its criterion-referenced indicators of creative ability
measurement. Journal of Psychoeducational Assess- through a longitudinal study. Creative Child and
ment, 8, 223–230. Adult Quarterly, 6, 136–140.
Thorne, A., & Gough, H. G. (1991). Portraits of type: Treffinger, D. J. (1985). Review of the Torrance Tests
An MBTI research compendium. Palo Alto, CA: Con- of Creative Thinking. In J. V. Mitchell, Jr. (Ed.),
sulting Psychologists Press. The ninth mental measurements yearbook (pp. 1632–
Thorne, F. C. (1966). The sex inventory. Journal of 1634). Lincoln, NE: University of Nebraska.
Clinical Psychology, 22, 367–374. Triandis, H. C. (1980). Values, attitudes, and interper-
Thornton, G. C., III, & Byham, W. C. (1982). Assess- sonal behavior. In M. M. Page (Ed.), Nebraska Sym-
ment centers and managerial performance. New York: posium on Motivation 1979 (pp. 195–259). Lincoln,
Academic Press. NE: University of Nebraska Press.
Throop, W. F., & MacDonald, A. P. (1971). Internal- Triandis, H. C., Kashima, Y., Hui, C. H., Lisansky, J.,
external locus of control: A bibliography. Psycholog- & Marin, G. (1982). Acculturation and bicultural-
ical Reports, 28, 175–190. ism indices among relatively acculturated Hispanic
Thurstone, L. L. (1934). The vectors of mind. Psycho- young adults. Interamerican Journal of Psychology,
logical Review, 41, 1–32. 16, 140–149.
Thurstone, L. L. (1938). Primary mental abilities. Psy- Trickett, E. J., & Moos, R. H. (1970). Generality and
chometric Monographs (No. 1). specificity of student reactions in high school class-
Thurstone, L. L. (1946). Comment. American Journal rooms. Adolescence, 5, 373–390.
of Sociology, 52, 39–50. Trickett, E. J., & Moos, R. H. (1973). Social environ-
Thurstone, L. L. (1952). Autobiography. In E. G. Bor- ment of junior high and high school classrooms.
ing et al. (Eds.), A history of psychology in autobiog- Journal of Educational Psychology, 65, 93–102.
raphy (Vol. 4, pp. 295–322). Worcester, MA: Clark Trieschmann, R. B. (1988). Spinal cord injuries: Psy-
University Press. chological, social, and vocational rehabilitation (2nd
Thurstone, L. L., & Chave, E. J. (1929). The measure- ed.). New York: Pergamon Press.
ment of attitudes. Chicago, IL: University of Chicago Trochim, W. M. K. (1985). Pattern matching, valid-
Press. ity, and conceptualization in program evaluation.
Tiedeman, D. V., & O’Hara, R. P. (1963). Career Evaluation Review, 9, 575–604.
development: Choice and adjustment. Princeton, NJ: Truch, S. (1989). WISC-R Companion. Seattle, WA:
College Entrance Examination Board. Special Child Publications.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 613

Trull, T. J., & Hillerbrand, E. (1990). Psychometric Turk, D. C., Rudy, T. E., & Salovey, P. (1985). The
properties and factor structure of the Fear Ques- McGill Pain Questionnaire reconsidered: Confirm-
tionnaire Phobia Subscale items in two normative ing the factor structure and examining appropriate
samples. Journal of Psychopathology and Behavioral uses. Pain, 21, 385–397.
Assessment, 12, 285–297. Turner, R. G. (1978). Individual differences in ability to
Trull, T. J., Neitzel, M. T., & Main, A. (1988). The use image nouns. Perceptual and Motor Skills, 47, 423–
of meta-analysis to assess the clinical significance of 434.
behavior therapy for agoraphobia. Behavior Ther- Turner, R. J., & Wheaton, B. (1995). Checklist mea-
apy, 19, 527–538. surement of stressful life events. In S. Cohen, R. C.
Trybus, R. (1973). personality assessment of entering Kessler, & L. U. Gordon (Eds.), Measuring stress (pp.
hearing-impaired college students using the 16PF, 29–58). New York: Oxford University Press.
Form B. Journal of Rehabilitation of the Deaf, 6, 34– Tursky, B. (1976). The development of a pain per-
40. ception profile: A psychophysical approach. In
Trybus, R., & Karchmer, M. A. (1977). School achieve- M. Weisenberg & B. Tursky (Eds.), Pain: New
ment scores of hearing impaired children: National perspectives in therapy and research. New York:
data on achievement status and growth patterns. Plenum.
American Annals of the Deaf, 122, 62–69. Tuttle, F. B., & Becker, L. A. (1980). Characteristics and
Tryon, G. S. (1980). The measurement and treatment identification of gifted and talented students. Wash-
of test anxiety. Review of Educational Research, 50, ington, DC: National Educational Association.
343–372. Tyler, F. B., Dhawan, N., & Sinha, Y. (1989). Cul-
Tryon, W. W. (1991). Activity measurement in psychol- tural contributions to constructing locus-of-control
ogy and medicine. New York: Plenum Press. attributions. Genetic, Social, and General Psychology
Tsushima, W. T. (1994). Short form of the WPPSI and Monographs, 115, 205–220.
WPSSI-R. Journal of Clinical Psychology, 50, 877– Tyler, S., & Elliott, C. D. (1988). Cognitive profiles of
880. groups of poor readers and dyslexic children on the
Tucker, M. F., Cline, V. B., & Schmitt, J. R. (1967). Pre- British Ability Scales. British Journal of Psychology,
diction of creativity and other performance mea- 79, 493–508.
sures from biographical information among phar- Tzeng, O. C. S., Maxey, W. A., Fortier, R., & Landis, D.
maceutical scientists. Journal of Applied Psychology, (1985). Construct evaluation of the Tennessee Self
57, 131–138. Concept Scale. Educational and Psychological Mea-
Tucker, W. H. (1994). Fact and fiction in the discovery surement, 45, 63–78.
of Sir Cyril Burt’s flaws. Journal of the History of the Tzeng, O. C. S., Ware, R., & Bharadwaj, N. (1991).
Behavioral Sciences, 30, 335–347. Comparison between continuous bipolar and
Tucker, W. H. (1997). Re-considering Burt: Beyond a unipolar ratings of the Myers-Briggs Type Indica-
reasonable doubt. Journal of the History of the Behav- tor. Educational and Psychological Measurement, 51,
ioral Sciences, 33, 145–162. 681–690.
Tuckman, J., & Lorge, I. (1952). The attitudes of the Ulissi, S. M., Brice, P. J., & Gibbins, S. (1989). Use of
aged toward the older worker for institutionalized the Kaufman-Assessment Battery for Children with
and non-institutionalized adults. Journal of Geron- the hearing impaired. American Annals of the Deaf,
tology, 7, 559–564. 134, 283–287.
Tuckman, J., & Lorge, I. (1953). Attitudes toward Ullman, L. P., & Krasner, L. (1975). A psychological
old people. Journal of Social Psychology, 37, approach to abnormal behavior (2nd ed.). Engle-
249–260. wood Cliffs, NJ: Prentice Hall.
Tuckman, J., & Lorge, I. (1958). Attitude toward aging Ullman, R. K., Sleator, E. K., & Sprague, R. L.
of individuals with experiences with the aged. Jour- (1991). ADDH Comprehensive Teacher’s Rating Scale
nal of Genetic Psychology, 92, 199–204. (ACTeRS). Champaign, IL: Metritech, Inc.
Tuddenham, R. D., Davis, L., Davison, L., & Schindler, Urban, W. J. (1989). The Black scholar and intelligence
R. (1958). An experimental group version for school testing: The case of Horace Mann Bond. Journal of
children of the Progressive Matrices. Journal of Con- the History of the Behavioral Sciences, 25, 323–334.
sulting Psychology, 22, 30. U.S. Department of Labor (1970). Manual for the
Tupes, E. C., & Christal, R. E. (1961). Recurrent per- USES General Aptitude Test Battery. Washington,
sonality factors based on trait ratings. (USAF ASD DC: Manpower Administration, U.S. Department
Technical Report No. 61-97). Lackland Air Force of Labor.
Base, TX: U.S. Air Force (cited in McCrae & Costa, Valencia, R. R. (1979). Comparison of intellectual per-
1986). formance of Chicano and Anglo third-grade boys
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

614 References

on the Raven’s Coloured Progressive Matrices. Psy- Vernon, P. E., & Allport, G. W. (1931). A test for per-
chology in the Schools, 16, 448–453. sonal value. Journal of Abnormal and Social Psychol-
Van den Brink, J., & Schoonman, W. (1988). First steps ogy, 26, 231–248.
in computerized testing at the Dutch Railways. In Vescovi, G. M. (1979). The emerging private psychol-
F. J. Maarse, L. J. M. Mulder, W. P. B. Sjouw, & ogist practitioner as contributer to vocational reha-
A. E. Akkerman (Eds.), Computers in psychology: bilitation process for deaf clients. Journal of Reha-
Methods, instrumentation and psychodiagnostica bilitation of the Deaf, 13, 9–19.
(pp. 184–188). Amsterdam: Swets & Zeitlinger. Vieweg, B. W., & Hedlund, J. L. (1984). Psychologi-
Vander Kolk, C. J. (1982). A comparison of intelligence cal Screening Inventory: A comprehensive review.
test score patterns between visually impaired sub- Journal of Clinical Psychology, 40, 1382–1393.
groups and the sighted. Rehabilitation Psychology, Vincent, K. R., & Cox, J. A. (1974). A re-evaluation of
27, 115–120. Raven’s Standard Progressive Matrices. The Journal
Vandeveer, B., & Schweid, E. (1974). Infant assessment: of Psychology, 88, 299–303.
Stability of mental functioning in young retarded Vinson, D. E., Munson, J. M., & Nakanishi, M. (1977).
children. American Journal of Mental Deficiency, 79, An investigation of the Rokeach Value Survey for
1–4. consumer research applications. In W. D. Perreault
Van de Vijver. F. J. R., & Harsveld, M. (1994). The (Ed.), Advances in consumer research (pp. 247–252).
incomplete equivalence of the paper-and-pencil Provo, UT: Association for Consumer Research.
and computerized versions of the General Apti- Vodanovich, S. J., & Kass, S. J. (1990). A factor analytic
tude Test Battery. Journal of Applied Psychology, 79, study of the boredom proneness scale. Journal of
852–859. Personality Assessment, 55, 115–123.
Van Gorp, W., & Meyer, R. (1986). The detection of Volicer, L., Hurley, A. C., Lathi, D. C., & Kowall,
faking on the Millon Clinical Multiaxial Inventory N. W. (1994). Measurement of severity in advanced
(MCMI). Journal of Clinical Psychology, 42, 742–747. Alzheimer’s disease. Journal of Gerontology, 49, 223–
Van Hagan, J., & Kaufman, A. S. (1975). Factor analy- 226.
sis of the WISC-R for a group of mentally retarded Volicer, L., Seltzer, B., Rheaume, Y., & Fabiszewski, K.
children and adolescents. Journal of Consulting and (1987). Progression of Alzheimer-type dementia in
Clinical Psychology, 43, 661–667. institutionalized patients: A cross-sectional study.
Vansickle, T. R., & Kapes, J. T. (1993). Comparing Journal of Applied Gerontology, 6, 83–94.
paper-pencil and computer-based versions of the Volker, M. A., Guarnaccia, V., & Scardapane, J. R.
Strong-Campbell Interest Inventory. Computers in (1999). Sort forms of the Stanford-Binet Intelligence
Human Behavior, 9, 441–449. Scale: Fourth edition for screening potentially gifted
Varble, D. L. (1971). Current status of the The- preschoolers. Journal of Psychoeducational Assess-
matic Apperception Test. In P. McReynolds (Ed.), ment, 17, 226–235.
Advances in psychological assessment (Vol. 2, Von Mayrhauser, R. T. (1989). Making intelligence
pp. 216–235). Palo Alto, CA: Science and Behavior functional: Walter Dill Scott and applied psycho-
Books, Inc. logical testing in World War I. Journal of the History
Veldman, D. J. (1967). Computer-based sentence com- of the Behavioral Sciences, 25, 60–72.
pletion interviews. Journal of Counseling Psychology, Vredenburg, K., Krames, L., Flett, G. L. (1985). Reex-
14, 153–157. amining the Beck Depression Inventory: The long
Vernon, M. (1968). Fifty years of research on the intel- and short of it. Psychological Reports, 57, 767–778.
ligence of the deaf and hard-of-hearing: A survey of Vulpe, S. G. (1982). Vulpe Assessment Battery. Toronto:
the literature and discussion of implications. Journal National Institute on Mental Retardation.
of Rehabilitation of the Deaf, 1, 1–12. Vygotsky, L. S. (1978). Mind in society: The develop-
Vernon, M. (1970). Psychological evaluation and inter- ment of higher psychological processes. Cambridge,
viewing of the hearing impaired. Rehabilitation MA: Harvard University Press.
Research and Practice Review, 1, 45–52. Waddell, D. D. (1980). The Stanford-Binet: An evalu-
Vernon, M., & Brown, D. W. (1964). A guide to psy- ation of the technical data available since the 1972
chological tests and testing procedures in the evalu- restandardization. Journal of School Psychology, 18,
ation of deaf and hard-of-hearing children. Journal 203–209.
of Speech and Hearing Disorders, 29, 414–423. Wagner, R. K., & Sternberg, R. J. (1986). Tacit knowl-
Vernon, M. C., & Andrews, J. F. (1990). The psychology edge and intelligence in the everyday world. In R. J.
of deafness. New York: Longman. Sternberg & R. K. Wagner (Eds.), Practical intelli-
Vernon, P. E. (1960). The structure of human abilities gence (pp. 51–83). Cambridge, MA: Cambridge Uni-
(Rev. ed.). London: Methuen. versity Press.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 615

Wahler, R. G., House, A. E., & Stanbaugh, E. E., II inmates. Journal of Consulting and Clinical Psychol-
(1976). Ecological assessment of child problem behav- ogy, 56, 111–117.
ior: A clinical package for home, school, and institu- Wampler, K. S., & Halverson, C. F. (1990). The Georgia
tional settings. New York: Pergamon. Marriage Q-sort: An observational measure of mari-
Wainer, H. (1987). The first four millenia of mental tal functioning. American Journal of Family Therapy,
testing: From Ancient China to the computer age. 18, 169–178.
Educational Testing Service Research Report, No. 87– Wang, K. A. (1932). Suggested criteria for writing atti-
34. tude statements. Journal of Social Psychology, 3, 367–
Wainer, H. (1993). Measurement problems. Journal of 373.
Educational Measurement, 30, 1–21. Wang, M. W., & Stanley, J. C. (1970). Differential
Walczyk, J. J. (1993). A computer program for con- weighting: A review of methods and empirical stud-
structing language comprehension tests. Computers ies. Review of Educational Research, 40, 663–705.
in Human Behavior, 9, 113–116. Wapner, S. (1990). Introduction. Journal of the History
Walk, R. D. (1956). Self ratings of fear in a fear-invoking of the Behavioral Sciences, 26, 107–113.
situation. Journal of Abnormal and Social Psychology, Ward, C. H., Beck, A. T., Mendelson, M., Mock, J. E.,
22, 171–178. & Erbaugh, J. K. (1962). The psychiatric nomencla-
Walker, D. K. (1973). Socioemotional measures for ture. Reasons for diagnostic disagreement. Archives
preschool and kindergarten children. San Francisco, of General Psychiatry, 7, 198–205.
CA: Jossey-Bass. Wardrop, J. L. (1989). Review of the California
Walker, N. W., & Myrick, C. C. (1985). Ethical con- Achievement Tests. In J. C. Conoley & J. J. Kramer
siderations in the use of computers in psychological (Eds.), The tenth mental measurements yearbook
testing and assessment. Journal of School Psychology, (pp. 128–133). Lincoln, NE: University of Nebraska
23, 51–57. Press.
Wallace, W. L. (1950). The relationship of certain vari- Ware, J. E., & Sherbourne, C. D. (1992). The MOS 36 –
ables to discrepancy between expressed and inven- item short-form Health Survey (SF-36). I. Concep-
toried vocational interest. American Psychologist, 5, tual framework and item selection. Medical Care,
354 (abstract). 30, 473–483.
Wallas, G. (1926). The art of thought. London: Watts. Ware, J. E., Snow, K. K., & Kosinski, M. (1993). SF-
Wallbrown, F. H., & Jones, J. A. (1992). Reevaluating 36 Health Survey: Manual and interpretation guide.
the factor structure of the Revised California Psy- Boston, MA: The Health Institute, New England
chological Inventory. Educational and Psychological Medical Center Hospitals.
Measurement, 52, 379–386. Waring, D., Farthing, C., & Kidder-Ashley, P.
Walls, R. T., Werner, T. J., Bacon, A., & Zane, T. (1977). (1999). Impulsive response style affects computer-
Behavior Checklists. In J. D. Cone & R. P. Hawkins administered multiple choice test performance.
(Eds.), Behavioral Assessment (pp. 77–146). New Journal of Instructional Psychology, 26, 121–128.
York: Brunner/Mazel. Waters, L. K. (1965). A note on the “fakability” of
Wallston, B. S., Wallston, K. A., Kaplan, G. D., & forced-choice scales. Personnel Psychology, 18, 187–
Maides, S. A. (1976). Development and validation 191.
of the Health Locus of Control Scale. Journal of Con- Watkins, C. E., Jr. (1986). Validity and usefulness of
sulting and Clinical Psychology, 44, 580–585. WAIS-R, WISC-R, and WPPSI short forms: A crit-
Wallston, K. A., Wallston, B. S., & DeVellis, R. (1978). ical review. Professional Psychology, 17, 36–43.
Development of the Multidimensional Health Locus Watkins, E. O. (1976). The Watkins Bender-Gestalt scor-
of Control (MHLC) Scales. Health Education Mono- ing system. Novato, CA: Academic Therapy.
graphs, 6, 160–169. Watson, B. U. (1983). Test-retest stability of the Hiskey-
Walsh, J. A. (1972). Review of the CPI. In O. K. Buros Nebraska Test of Learning Aptitude in a sample of
(Ed.), The Seventh mental measurements yearbook hearing-impaired children and adolescents. Journal
(Vol. 1, pp. 96–97). Highland Park, NJ: Gryphon of Speech and Hearing Disorders, 48, 145–149.
Press. Watson, B. U., & Goldgar, D. E. (1985). A note on the
Walters, G. D. (1988). Schizophrenia. In R. L. use of the Hiskey-Nebraska Test of Learning Apti-
Greene (Ed.), The MMPI: Use with specific pop- tude with deaf children. Language, Speech, and Hear-
ulations (pp. 50–73). Philadelphia, PA: Grune & ing Services in Schools, 16, 53–57.
Stratton. Watson, D. (1979). Guidelines for the psychological
Walters, G. D., White, T. W., & Greene, R. L. (1988). Use and vocational assessment of deaf rehabilitation
of the MMPI to identify malingering and exaggera- clients. Journal of Rehabilitation of the Deaf, 13, 27–
tion of psychiatric symptomatology in male prison 57.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

616 References

Watson, D. (1989). Strangers’ ratings of the five robust Weiner, B. (1980). Human motivation. New York: Holt,
personality factors: Evidence of a surprising con- Rinehart and Winston.
vergence with self-report. Journal of Personality and Weiner, I. (1977). Approaches to Rorschach validation.
Social Psychology, 57, 120–128. In M. A. Rickers-Ovsiankina (Ed.), Rorschach psy-
Watson, R. I. (1966). The role and use of history in the chology. Huntington, NY: Robert E. Krieger.
Psychology curriculum. Journal of the History of the Weiner, I. B. (1994). The Rorschach Inkblot Method
Behavioral Sciences, 2, 64–69. (RIM) is not a test: Implications for theory and prac-
Watts, K., Baddeley, A., & Williams, M. (1982). Auto- tice. Journal of Personality Assessment, 62, 498–504.
mated tailored testing using Raven’s matrices and Weiss, D. J. (1985). Adaptive testing by computer.
the Mill Hill vocabulary test: A comparison with Journal of Consulting and Clinical Psychology, 53,
manual administration. International Journal of 774–789.
Man-Machine Studies, 17, 331–344. Weiss, D. J., & Davison, M. L. (1981). Test theory and
Webb, J. T., Miller, M. L., & Fowler, R. D. Jr. methods. Annual Review of Psychology, 32, 629–658.
(1970). Extending professional time: A computer- Weiss, D. J., & Dawis, R. V. (1960). An objective vali-
ized MMPI interpretation service. Journal of Clinical dation of factual interview data. Journal of Applied
Psychology, 26, 210–214. Psychology, 44, 381–385.
Webb, S. C. (1955). Scaling of attitudes by the method Weiss, R. L., & Margolin, G. (1977). Marital conflict
of equal-appearing intervals: A review. Journal of and accord. In A. R. Ciminero, K. S. Calhoun, & H. E.
Social Psychology, 42, 215–239. Adams (Eds.), Handbook for behavioral assessment.
Wechsler, D. (1939). The measurement of adult intelli- New York: Wiley.
gence. Baltimore, MD: Williams & Wilkins. Weiss, R. S. (1973). Loneliness: The experience of emo-
Wechsler, D. (1941). The measurement of adult intelli- tional and social isolation. Cambridge, MA: MIT
gence (2nd ed.). Baltimore, MD: Williams & Wilkins. Press.
Wechsler, D. (1958). The measurement and appraisal Weissman, M. M., Sholomskas, D., Pottenger, M., Pru-
of adult intelligence. Baltimore, MD: Williams & soff, B. A., & Locke, B. Z. (1977). Assessing depres-
Wilkins. sive symptoms in five psychiatric populations: A
Wechsler, D. (1967). Manual for the Wechsler Preschool validation study. American Journal of Epidemiology,
and Primary Scale of intelligence. New York: Psycho- 106, 203–214.
logical Corporation. Weitz, J. (1950). Verbal and pictorial questionnaires in
Wechsler, D. (1974). Manual for the Wechsler Intelli- market research. Journal of Applied Psychology, 34,
gence Scale for Children – Revised. New York: Psy- 363–366.
chological Corporation. Weitz, J., & Nuckols, R. C. (1953). A validation study
Wechsler, D. (1975). Intelligence defined and of “How Supervise?” Journal of Applied Psychology,
undefined: A relativistic approach. American Psy- 37, 7–8.
chologist, 30, 135–139. Welch, G., Hall, A., & Norring, C. (1990). The fac-
Wechsler, D. (1981). Wechsler Adult Intelligence Scale – tor structure of the Eating Disorder Inventory in a
Revised. New York: Psychological Corporation. patient setting. International Journal of Eating Dis-
Wechsler, D. (1984). WISC-RM escala de inteligencia orders, 9, 79–85.
para nivel scolar Wechsler. Mexico, DF: El Manual Welch, G., Hall, A., & Walkey, F. (1988). The factor
Moderno. structure of the Eating Disorders Inventory. Journal
Wechsler, D. (1987). Wechsler Memory Scale – Revised of Clinical Psychology, 44, 51–56.
manual. New York: The Psychological Corporation. Welch, G., Hall, A., & Walkey, F. (1990). The replica-
Wechsler, D. (1989). Wechsler Preschool and Primary ble dimensions of the Beck Depression Inventory.
Scale of Intelligence – Revised (WPPSI-R). San Anto- Journal of Clinical Psychology, 46, 817–827.
nio, TX: Psychological Corporation. Wellman, B. L., Skeels, H. M., & Skodak, M. (1940).
Wechsler, D. (1991). Wechsler Intelligence Scale for Review of McNemar’s critical examination of Iowa
Children-Third Edition: Manual. New York: The Psy- studies. Psychological Bulletin, 37, 93–111.
chological Corporation. Welsh, G. S. (1948). An extension of Hathaway’s MMPI
Weckowicz, T. E., Muir, W., & Cropley, A. J. (1967). A profile coding system. Journal of Consulting Psychol-
factor analysis of the Beck Inventory of Depression. ogy, 12, 343–344.
Journal of Consulting Psychology, 31, 23–28. Welsh, G. S. (1956). Factor dimensions A & R. In G. S.
Weinberger, M., Hiner, S. L., & Tierney, W. M. (1987). Welsh & W. G. Dahlstrom (Eds.), Basic readings on
In support of hassles as a measure of stress in predict- the MMPI in psychology and medicine (pp. 264–
ing health outcomes. Journal of Behavioral Medicine, 281). Minneapolis, MN: University of Minnesota
10, 19–31. Press.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 617

Welsh, G. S. (1966). Comparison of the D-48, Ter- White, K., Sheehan, P. W., & Ashton, R. (1977).
man CMT, and Art Scale scores of gifted adolescents. Imagery assessment: A survey of self-report mea-
Journal of Consulting Psychology, 30, 88. sures. Journal of Mental Imagery, 1, 145–170.
Werner, E. E., Bierman, J. M., & French, F. E. (1971). White, K. O. (1978). Testing the handicapped for
The children of Kavai: A longitudinal study from the employment purposes: adaptations for persons with
prenatal period. Honolulu, HI: University of Hawaii motor handicaps (PS 78–4). Washington, DC: Per-
Press. sonnel Research and Development Center, U.S. Civil
Werner, P. D. (1993). A Q-sort measure of beliefs about Service Commission.
abortion in college students. Educational and Psy- Whitehorn, J. C., & Betz, B. J. (1960). Further studies
chological Measurement, 53, 513–521. of the doctor as a crucial variable in the outcome of
Werner, S. H., Jones, J. W., & Steffy, B. D. (1989). The treatment with schizophrenics. American Journal of
relationship between intelligence, honesty, and theft Psychiatry, 117, 215–223.
admissions. Educational and Psychological Measure- Whitney, D. R., Malizio, A. G., & Patience, W. M.
ment, 49, 921–927. (1986). Reliability and validity of the GED tests.
Wertheimer, M. (1958). Principles of perceptual orga- Educational and Psychological Measurement, 46,
nization. In D. C. Beardsless & M. Wertheimer 689–698.
(Eds.), Readings in perception. New York: Van Nos- Whitworth, R. H. (1984). Bender Visual Motor Gestalt
trand. Test. In D. J. Keyser & R. C. Sweetland (Eds.), Test
Wesman, A. G. (1968). Intelligent testing. American critiques (Vol. I., pp 90–98). Kansas City, MO: Test
Psychologist, 23, 267–274. Corporation of America.
Wesman, A. G. (1971). Writing the test item. In R. L. Whitworth, R. H. (1987). The Halstead-Reitan
Thorndike (Ed.), Educational Measurement (2nd Neuropsychological battery and allied procedures.
ed.). Washington, DC: American Council on Edu- In D. J. Keyser & R. C. Sweetland (Eds.), Test cri-
cation. tiques compendium (pp. 196–205). Kansas City, MO:
Westman, J. C. (1990). Handbook of learning Test Corporation of America.
disabilities: A multisystem approach. Boston, MA: Whitworth, R. H., & Barrientos, G. A. (1990). Compar-
Allyn & Bacon. ison of Hispanic and Anglo Graduate Record Exam-
Wetzler, S. (1989a). Parameters of psychological assess- ination scores and academic performance. Journal
ment. In S. Wetzler & M. M. Katz (Eds.), Contem- of Psychoeducational Assessment, 8, 128–132.
porary approaches to psychological assessment (pp. Wicker, A. W. (1969). Attitudes versus actions:
3–15). New York: Brunner/Mazel. The relationship of verbal and overt behavioral
Wetzler, S. (1989b). Self-report tests: The patient’s van- responses to attitude objects. Journal of Social Issues,
tage. In S. Wetzler & M. M. Katz (Eds.), Contem- 25, 41–78.
porary approaches to psychological assessment (pp. Wider A. (1948). The Cornell Medical Index. New York:
98–117). New York: Brunner/Mazel. Psychological Corporation.
Wetzler, S., Kahn, R., Strauman, T. J., & Dubro, Widiger, T. A., & Frances, A. (1987). Interviews and
A. (1989). Diagnosis of major depression by self- inventories for the measurement of personality dis-
report. Journal of Personality Assessment, 53, 22–30. orders. Clinical Psychology Review, 7, 49–75.
Wetzler, S., & Marlowe, D. B. (1993). The diagnosis Widiger, T. A., & Kelso, K. (1983). Psychodiagnosis of
and assessment of depression, mania, and psychosis Axis II. Clinical Psychology Review, 3, 491–510.
by self-report. Journal of Personality Assessment, 60, Widiger, T. A., & Sanderson, C. (1987). The convergent
1–31. and discriminant validity of the MCMI as a measure
Wharton, Y. L. (1977). List of hypotheses advanced to of the DSM-III personality disorders. Journal of Per-
explain the SAT score decline. New York: College sonality Assessment, 51, 228–242.
Entrance Examination Board. Widiger, T. A., Williams, J. B., Spitzer, R. L., & Frances,
Whipple, G. M. (1910). Manual of mental and physical A. (1985). The MCMI as a measure of DSM-III.
tests. Baltimore, MD: Warwick and York. Journal of Personality Assessment, 49, 366–378.
White, D. M., Clements, C. B., & Fowler, R. D. (1986). A Widiger, T. A., Williams, J. B., Spitzer, R. L., & Frances,
comparison of computer administration with stan- A. (1986). The MCMI and DSM-III: A brief rejoin-
dard administration of the MMPI. Computers in der to Millon (1985). Journal of Personality Assess-
Human Behavior, 1, 153–162. ment, 50, 198–204.
White, D. R., & Jacobs, E. (1979). The prediction of Wiederman, M. W., & Allgeier, E. R. (1993). The
first-grade reading achievement from WPPSI scores measurement of sexual-esteem: Investigation of
of preschool children. Psychology in the Schools, 16, Snell and Papini’s (1989) Sexuality Scale. Journal of
189–192. Research in Personality, 27, 88–102.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

618 References

Wiechmann, G. H., & Wiechmann, L. A. (1973). Williams, R. L. (1972). Abuses and misuses in testing
Multiple factor analysis: An approach to attitude black children. In R. L. Jones (Ed.), Black psychology.
validation. Journal of Experimental Education, 41, New York: Harper & Row.
74–84. Williams, R. L. (1974). Scientific racism and IQ: The
Wiener, D. N. (1948). Subtle and obvious keys for the silent mugging of the black community. Psychology
Minnesota Multiphasic Personality Inventory. Jour- Today, 7, 32ff.
nal of Consulting Psychology, 12, 164–170. Williams, R. L. (1975). The BITCH-100: A culture-
Wierzbicki, M., & Daleiden, E. L. (1993). The differ- specific test. Journal of Afro-American Issues, 3, 103–
ential responding of college students to subtle and 116.
obvious MCMI subscales. Journal of Clinical Psy- Willingham, W. W. (1989). Standard testing condi-
chology, 49, 204–208. tions and standard score meaning for handicapped
Wiggins, G. (1990). The case for authentic assessment. examinees. Applied Measurement in Education, 2,
ERIC clearinghouse on tests, measurement, and eval- 97–103.
uation. Washington, DC: American Institutes for Willingham, W. W., Ragosta, M., Bennett, R. E., Braun,
Research. H., Rock, D. A., & Powers, D. E. (1988). Testing hand-
Wiggins, J. S. (1968). Personality structure. Annual icapped people. Boston, MA: Allyn & Bacon.
Review of Psychology, 19, 293–350. Wilson, E. L. (1980). The use of psychological tests in
Wiggins, J. S. (1973). Personality and Prediction: diagnosing the vocational potential of visually hand-
Principles of personality assessment. Reading, MA: icapped persons who enter supportive and unskilled
Addison-Wesley. occupations. In B. Bolton & D. W. Cook (Eds.),
Wiggins, J. S. (1982). Circumplex models of inter- Rehabilitation client assessment (pp. 65–77). Balti-
personal behavior in clinical psychology. In P. C. more, MD: University Park Press.
Kendall & J. N. Butcher (Eds.), Handbook of research Wilson, F. R., Genco, K. T., & Yager, G. G. (1985).
methods in clinical psychology (pp. 183–221). New Assessing the equivalence of paper-and-pencil vs.
York: Wiley. computerized tests: Demonstration of a promis-
Wiggins, J. S. (1989). Review of the Personality ing methodology. Computers in Human Behavior,
Research Form (3rd ed.). In J. C. Conoley & J. J. 1, 265–275.
Kramer (Eds.), The tenth mental measurements year- Wilson, G. D. (1973). The psychology of conservatism.
book (pp. 633–634). Lincoln, NE: University of New York: Academic Press, 1973.
Nebraska Press. Wilson, G. D., & Patterson, J. R. (1968). A new measure
Wiggins, J. S., & Pincus, A. L. (1992). Personality: of conservatism. British Journal of Social and Clinical
Structure and assessment. In M. R. Rosenzweig & Psychology, 7, 264–269.
L. W. Porter (Eds.), Annual review of psychology (Vol. Wilson, K. M. (1974). The contribution of measures
43, pp. 473–504). Palo Alto, CA: Annual Reviews. of aptitude (SAT) and achievement (CEEB Achieve-
Wikoff, R. L. (1979). The WISC-R as a predic- ment Average), respectively in forecasting college
tor of achievement. Psychology in the Schools, 16, grades in several liberal arts colleges. ETS Research
364–366. Bulletin, No. 74–36.
Wilkening, G. N., Golden, C. J., Maclnnes, W. D., Wilson, R. C., Christensen, P. R., Merrifield, P. R.,
Plaisted, J. R., & Hermann, B. P. (1981, August). The & Guilford, J. P. (1960). Alternate uses. Manual of
Luria-Nebraska Neuropsychological Battery – Chil- administration, scoring, and interpretation. Beverly
dren’s revision: A preliminary report. Paper presented Hills, CA: Sheridan Supply Co.
at the meeting of the American Psychological Asso- Wilson, R. S. (1975). Twins: Patterns of cognitive devel-
ciation, Los Angeles, CA. opment as measured on the Wechsler Preschool and
Williams, J. B. W., Gibbon, M., First, M. B., Spitzer, Primary Scale of Intelligence. Developmental Psy-
R. L., Davies, M., Borus, J., Howes, M. J., Kane, J., chology, 11, 126–134.
Pope, H. G., Rounsaville, B., & Wittchen, H. (1992). Wilson, S. L. (1991). Microcomputer-based psycho-
The Structured Clinical Interview for DSM-III-R logical assessment – an advance in helping severely
(SCID): Multiple test-retest reliability. Archives of physically disabled people. In P. L. Dann, S. H.
General Psychiatry, 42, 630–636. Irvine, & J. M. Collis (Eds.), Advances in computer-
Williams, R. L. (1970). Black pride, academic rele- based human assessment (pp. 171–187). Durdrecht,
vance, and individual achievement. The Counseling The Netherlands: Kluwer Academic Publishers.
Psychologist, 2, 18–22. Wilson, S. L., Thompson, J. A., & Wylie, G. (1982).
Williams, R. L. (1971). Abuses and misuses in testing Automated psychological testing for the severely
black children. The Counseling Psychologist, 2, 62– physically handicapped. International Journal of
73. Man-Machine Studies, 17, 291–296.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 619

Wing, J. K. (Ed.). (1966). Early childhood autism. Wolf, T. M., Elston, R. C., & Kissling, G. E. (1989).
Oxford: Pergamon Press. Relationship of hassles, uplifts, and life events to
Winter, W. D., Ferreira, A. J., & Olson, J. L. (1966). psychological well-being of freshman medical stu-
Hostility themes in the family TAT. Journal of Pro- dents. Behavioral Medicine, 15, 37–45.
jective Techniques and Personality Assessment, 30, Wolk, R. L. (1972). Refined projective techniques with
270–274. the aged. In D. P. Kent, R. Kastenbaum, & S. Sher-
Winterling, D., Crook, T., Salama, M., & Gabert, J. wood (Eds.), Research planning and action for the
(1986). A self-rating scale for assessing memory loss. elderly (pp. 218–244). New York: Behavioral Publi-
In A. Bes, J. Cahn, S. Hoyer, J. P. Marc-Vergnes, cations.
& H. M. Wisniewski (Eds.), Senile dementias: Early Wolk, R. L., & Wolk, R. B. (1971). Manual: Gerontolog-
detection (pp. 482–486). London: John Libbey Euro- ical Apperception Test, New York: Human Sciences
text. Press.
Winters, K. C., & Henley, G. A. (1989). Personal expe- Wolk, S., & Zieziula, F. R. (1985). Reliability of the 1973
rience inventory (PEI) test and manual. Los Angeles, Edition of the SAT-HI over time: Implications for
CA: Western Psychological Services. assessing minority students. American Annals of the
Wirt, D. D., Lachar, D., Klinedinst, J. K., & Seat, Deaf, 130, 285–290.
P. D. (1984). Multidimensional description of child Wollersheim, J. P. (1970). Effectiveness of group ther-
personality: A manual for the Personality Inventory apy based upon learning principles in the treatment
for Children (Revised 1984 by D. Lachar). Los Ange- of overweight women. Journal of Abnormal Psychol-
les, CA: Western Psychological Services. ogy, 76, 462–474.
Wirt, R. D., Lachar, D., Klinedinst, J. K., & Seat, P. D. Wolman, B. B. (Ed.). (1985). Handbook of intelligence.
(1990). Personality Inventory for Children – 1990 edi- New York: Wiley.
tion. Los Angeles, CA: Western Psychological Ser- Wolpe, J. (1973). The practice of behavior therapy (2nd
vices. ed.). New York: Pergamon.
Wirtz, W., & Howe, W. (1977). On further examination: Wolpe, J., & Lang, P. J. (1964). A fear survey schedule
Report of the advisory panel on the Scholastic Apti- for use in behavior therapy. Behavior Research and
tude Test score decline. New York: College Entrance Therapy, 2, 27–30.
Examination Board. Wolpe, J. , & Lang, P. J. (1977). Manual for the Fear
Wise, S. L., & Wise, L. A. (1987). Comparison Survey Schedule (Rev.). San Diego, CA: Educational
of computer-administered and paper-administered and Industrial Testing Service.
achievement tests with elementary school children. Wong, Y. I. (2000). Measurement properties of the
Computers in Human Behavior, 3, 15–20. Center for Epidemiologic Studies – Depression Scale
Wisniewski, J. J., & Naglieri, J. A. (1989). Validity of the in a homeless population. Psychological Assessment,
Draw A Person: A quantitative scoring system with 12, 69–76.
the WISC-R. Journal of Psychoeducational Assess- Wood, J. M., Nezworski, M. T., & Stejskal, W. J. (1996).
ment, 7, 346–351. The comprehensive system for the Rorschach: A crit-
Witt, J. C., Heffer, R. W., & Pfeiffer, J. (1990). Structured ical examination. Psychological Science, 7, 3–10.
rating scales: A review of self-report and informant Wood, J. M., Nezworski, M. T., & Stejskal, W. J. (1997).
rating processes, procedures, and issues. In C. R. The reliability of the Comprehensive System for the
Reynolds & R. W. Kamphaus (Eds.), Handbook of Rorschach: A comment on Meyer (1997). Psycho-
psychological and educational assessment of children logical Assessment, 9, 490–494.
(pp. 364–394). New York: Guilford Press. Woodworth, R. S. (1920). Personal Data Sheet.
Wodrich, D. L., & Kush, S. A. (1990). Children’s psy- Chicago, IL: Stoelting.
chological testing (2nd ed.). Baltimore, MD: Paul H. Woon, T., Masuda, M., Wagner, N. N., & Holmes,
Brookes. T. H. (1971). The Social Readjustment Rating Scale:
Wolf, T. H. (1969a). The emergence of Binet’s concep- a cross-cultural study of Malaysians and Americans.
tion and measurement of intelligence: A case history Journal of Cross-Cultural Psychology, 2, 373–386.
of the creative process. Journal of the History of the Worchel, F. F., Aaron, L. L., & Yates, D. F. (1990). Gender
Behavioral Sciences, 5, 113–134. bias on the Thematic Apperception Test. Journal of
Wolf, T. H. (1969b). The emergence of Binet’s concep- Personality Assessment, 55, 593–602.
tion and measurement of intelligence: A case history Worthen, B. R., Borg, W. R., & White, K. R. (1993).
of the creative process. Part II. Journal of the History Measurement and evaluation in the schools: A prac-
of the Behavioral Sciences, 5, 207–237. tical guide. White Plains, NY: Longman.
Wolf, T. H. (1973). Alfred Binet. Chicago, IL: University Wright, B. D., & Stone, M. H. (1985). Review of the
of Chicago Press. British Ability Scales. In J. V. Mitchell, Jr. (Ed.),
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

620 References

The ninth mental measurements yearbook (Vol. 1, edge (pp. 315–334). New York: Cambridge Univer-
pp. 232–235). Lincoln, NE: University of Nebraska sity Press.
Press. Zapf, P. A., & Viljoen, J. L. (2003). Issues and considera-
Wright, D., & DeMers, S. T. (1982). Comparison of the tions regarding the use of assessment instruments in
relationship between two measures of visual-motor the evaluation of competency to stand trial. Behav-
coordination and academic achievement. Psychol- ioral Sciences and the Law, 21, 351–367.
ogy in the Schools, 19, 473–477. Zautra, A. J., & Reich, J. W. (1983). Life events and
Wright, J. H., & Hicks, J. M. (1966). Construction perceptions of life quality: Developments in a two-
and validation of a Thurstone scale of Liberalism- factor approach. Journal of Community Psychology,
Conservatism. Journal of Applied Psychology, 50, 9– 1, 121–132.
12. Zea, M. C., & Tyler, F. B. (1994). Illusions of control – A
Wylie, R. C. (1974). The self-concept: A review of factor-analytic study of locus of control in Colom-
methodological considerations and measuring instru- bian students. Genetic, Social and General Psychology
ments (Rev. ed., Vol. 1). Lincoln, NE: University of Monographs, 120, 201–224.
Nebraska Press. Zebb, B. J., & Meyers, L. S. (1993). Reliability and valid-
Wytek, R., Opgenoorth, E., & Presslich, O. (1984). ity of the Revised California Psychological Inven-
Development of a new shortened version of Raven’s tory’s Vector 1 scale. Educational and Psycological
Matrices Test for application and rough assessment Measurement, 53, 271–280.
of present intellectual capacity within psychopatho- Zehrbach, R. R. (1975). Comprehensive Identification
logical investigation. Psychopathology, 17, 49–58. Process. Bensenville, IL: Scholastic Testing Service.
Yachnick, M. (1986). Self-esteem in deaf adolescents. Zeiss, A. M. (1980). Aversiveness vs. change in the
American Annals of the Deaf, 131, 305–310. assessment of life stress. Journal of Psychosomatic
Yank, J., McCrae, R. R., Costa, P. T. Jr., Dai, X., Yao, S., Research, 24, 15–19.
Cai, T., & Gao, B. (1999). Cross-cultural personality Zeleznik, C., Hojat, M., & Veloski, J. J. (1987). Predic-
assessment in psychiatric populations: The NEO- tive validity of the MCAT as a function of under-
PI-R in the People’s Republic of China. Psychological graduate institution. Journal of Medical Education,
Assessment, 11, 359–368. 62, 163–169.
Yerkes. R. M., & Foster, J. C. (1923). A point scale for Zelinski, E. M., Gilewski, M. J., & Thompson, L. W.
measuring mental ability. Baltimore, MD: Warwick (1980). Do laboratory tests relate to self-assessment
and York. of memory ability in the young and old? In L. W.
Yin, P., & Fan, X. (2000). Assessing the realibility of Poon, J. L. Fozard, L. S. Cermak, D. Arenberg, &
Beck Depression Inventory scores: realibility gener- L. W. Thompson (Eds.), New directions in memory
alization across studies. Educational and Psycholog- and aging (pp. 519–544). Hillsdale, NJ: Erlbaum.
ical Measurement, 60, 201–223. Zenderland, L. (1997). The Bell Curve and the shape
Ying, Y. (1988). Depressive symptomatology among of history. Journal of the History of the Behavioral
Chinese-Americans as measured by the CES-D. Sciences, 33, 135–139.
Journal of Clinical Psychology, 44, 739–746. Zerbe, W. J., & Paulhus, D. L. (1987). Socially desirable,
Yudin, L. W. (1966). An abbreviated form of the WISC responding in organizational behavior: A reconcep-
for use with emotionally disturbed children. Journal tion. Academy of Management Review, 12, 250–264.
of Consulting Psychology, 30, 272–275. Zielinski, J. J. (1993). A comparison of the Wechsler
Zajonc, R. B., & Bargh, J. (1980). Birth order, family Memory Scale – Revised and the Memory Assess-
size, and decline of SAT scores. American Psycholo- ment Scales: Administrative, clinical, and interpre-
gist, 35, 662–668. tive issues. Professional Psychology, 24, 353–359.
Zakahi, W. R., & Duran, R. L. (1982). All the lonely Zieziula, F. R. (Ed.). (1982). Assessment of hearing-
people: The relationship among loneliness, commu- impaired people. Washington, DC: Gallaudet Col-
nicative competence, and communication anxiety. lege.
Communication Quarterly, 30, 203–209. Zigler, E., Balla, D., & Hodapp, R. (1984). On the
Zanna, M. P., Olson, J. M., & Fazio, R. H. (1980). definition and classification of mental retardation.
Attitude-behavior consistency: An individual differ- American Journal of Mental Deficiency, 89, 215–230.
ence perspective. Journal of Personality and Social Zigler, E., & Muenchow, S. (1992). Head Start: The
Psychology, 38, 432–440. inside story of America’s most successful educational
Zanna, M. P., & Rempel, J. K. (1988). Attitudes: A experiment. New York: Basic Books.
new look at an old concept. In D. Bartal & A. W. Zilboorg, G. (1941). A history of medical psychology.
Kruglanski (Eds.), The social psychology of knowl- New York: Norton.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 4, 2006 14:36

References 621

Zimiles, H. (1996). Rethinking the validity of psycho- 1392–1394). Lincoln, NE: University of Nebraska
logical assessment. American Psychologist 51, 980– Press.
981. Zuckerman, M., & Lubin, B. (1965). Manual for the
Zimmerman, I. L., & Woo-Sam, J. (1972). Research Multiple Affect Adjective Check List. San Diego, CA:
with the Wechsler Scale for Children: 1960–1970 Educational & Industrial Testing Service.
(Special Monograph Supplement). Psychology in the Zung, W. W. K. (1965). A Self-rating Depression Scale.
Schools, 9, 232–271. Archives of General Psychiatry, 12, 63–70.
Zimmerman, I. L., Woo-Sam, J., & Glasser, A. J. (1973). Zung, W. W. K. (1969). A cross-cultural survey of
The clinical interpretation of the Wechsler Adult Intel- symptoms in depression. American Journal of Psy-
ligence Scale. Orlando, FL: Grune & Stratton. chiatry, 126, 116–121.
Zimmerman, M. (1983). Methodological issues in the Zwiebel, A., & Mertens, D. M. (1985). A comparison of
assessment of life events: A review of issues and intellectual structure in deaf and hearing children.
research. Clinical Psychology Review, 3, 339–370. American Annals of the Deaf, 130, 27–31.
Zimmerman, M. (1983). Weighted versus unweighted Zytowski, D. G. (1981). Counseling with the Kuder
life event scores: Is there a difference? Journal of Occupational Interest Survey. Chicago, IL: Science
Human Stress, 9, 30–35. Research Associates.
Zimmerman, M. (1986). The stability of the revised Zytowski, D. G. (1985). Kuder DP manual supplement.
Beck Depression Inventory in college students: Rela- Chicago, IL: Science Research Associates.
tionship with Life Events. Cognitive Therapy and Zytowski, D. G., & Kuder, F. (1986). Advances in the
Research, 10, 37–43. Kuder Occupational Interest Survey. In W. B. Walsh
Zingale, S. A., & Smith, M. D. (1978). WISC-R patterns & S. H. Osipow (Eds.), Advances in vocational psy-
for learning disabled children at three SES levels. chology (Vol. 1, pp. 31–53). Hillsdale, NJ: Erlbaum.
Psychology in the Schools, 15, 199–204. Zytowski, D. G., & Laing, J. (1978). Validity of other-
Ziskin, J., & Faust, D. (1988). Coping with psychi- gender-normed scales on the Kuder Occupational
atric and psychological testimony (Vols. 1–3, 4th ed.). Interest Survey. Journal of Counseling Psychology, 3,
Marina Del Rey, CA: Law and Psychology Press. 205–209.
Zuckerman, M. (1985). Review of Sixteen Personality Zytowski, D. G., & Warman, R. E. (1982). The changing
Factor Questionnaire. In J. V. Mitchell, Jr. (Ed.), The use of tests in counseling. Measurement and Evalu-
ninth mental measurements yearbook (Vol. II, pp. ation in Guidance, 15, 147–152.
P1: JZP
0521861810rfa3 CB1038/Domino 0 521 86181 0 March 6, 2006 19:0

622
P1: JZP
0521861810ind CB1038/Domino 0 521 86181 0 February 24, 2006 14:51

Test Index


Tests marked by an asterisk are mentioned briefly or in passing

AAMD Adaptive Behavior Scale, 234 (1972), 101
Acculturation Rating Scale for Mexican Americans (1986), 102
(ARSMA), 286 Black Intelligence Test of Cultural Homogeneity
Acculturation Scale (Marin), 285 (BITCH.100), 292–293

Acculturation Scale (Olmedo), 286 Blacky Pictures Test, 398

Acquiescence scale, 449 Boehm Test of Basic Concepts, 241

ACT Interest Inventory, 158 Bogardus Social Distance Scale, 133
Activities of Daily Living, 415 Boredom Proneness Scale (BP), 87
∗ ∗
Adaptive Behavior Inventory for Children, Bracken Basic Concept Scale, 241

234 Brazelton Behavioral Assessment Scale, 230–231
Adjective Check List (ACL), 209, 513 Brief College Students Hassles Scale (BCSHS), 218

Adult Self-expression Scale (ASES), 497 Brigance Inventory of Early Development, 327
Aging Anxiety Scale, 261 British Ability Scales (BAS), 116

Alternate Uses Test (AUT), 209 Bruiniks-Oseretsky Test of Motor Proficiency, 238

Alzheimer’s Disease Assessment Scale, 266

American College Testing Program (ACT), 335 Cain-Levine Social Competency Scale, 230–231

Armed Forces Qualification Test (AFQT), 376 California Achievement Tests, 328
∗ ∗
Armed Services Vocational Aptitude Battery California F scale, 421
(ASVAB), 376 California Psychological Inventory (CPI), 80

Army Alpha, 525–526 California Q set, 513
∗ ∗
Army Beta, 525–526 Candidate Profile Record, 365–366
∗ ∗
Attention Deficit Disorder-Hyperactivity Career Assessment Inventory, 159

Comprehensive Teacher’s Rating Scale, 237 Career Decision Scale, 159

Attitudes towards elderly, 260 Carrow Elicited Language Inventory, 241
Cattell 16 Personality Factors (16PF), 72
Bayley Scales of Infant Development, 228 Cattell Culture Fair Intelligence Test, 287
Beck Depression Inventory (BDI), 189, 451 Center for Epidemiologic Studies-Depression
Bedford Alzheimer Nursing Scale, 267 (CES-D), 193

Behavior Problem Checklist, 252 Child Behavior Checklist (CBCL), 246, 252

Bender Visual Motor Gestalt Test, 404 Children’s Adaptive Behavior Scale, 234
∗ ∗
Benton Visual Retention Test, 267 Children’s Apperception Test, 398
Binet-Simon Scale Children’s Assertive Behavior Scale (CABS), 498
(1905), 93, 100 Children’s Inventory of Anger (CIA), 492
(1908), 101 Chinese Tangrams, 211
(1911), 101 Chinese Value Survey (CVS), 365

(1916), 101 Clark-Madison Test of Oral Language, 241
(1937), 101 Classroom Environment Scale (CES), 504

(1960), 101 Cognitive Capacity Screening Examination, 267

623
P1: JZP
0521861810ind CB1038/Domino 0 521 86181 0 February 24, 2006 14:51

624 Test Index

College Characteristics Index, 505 Geriatric Scale of Recent Life Events, 265
∗ ∗
Colorado Childhood Temperament Inventory, Gerontological Apperception Test, 270, 398
230–231 Gesell Developmental Schedules, 228

Columbia Mental Maturity Scale, 238 Graduate Record Examination (GRE), 342

Competency Screening Test, 420 Grassi Basic Cognition Evaluation, 245, 246

Competitiveness index, 216

Comprehensive Identification Process, 224 Halstead-Reitan Neuropsychological Battery, 390,

Concept Assessment Kit, 95 455

Conners Rating Scales, 253 Halstead-Reitan Neuropsychological Battery for
Conservatism Scale (C scale), 138 Children, 247
Hardiness test, 410
Daily Child Behavior Checklist (DCBC), 493 Hassles and Uplifts, 414
Death anxiety scale (DAS), 219 Hassles scale, 218

Death Images Scale, 265 Hayes-Binet, 525
∗ ∗
Denman Neuropsychology Memory Scale, 267 Healy-Fernald tests, 525
Dental Admission Testing Program (DAT), 352 Hiskey-Nebraska Tests of Learning Aptitude, 243
∗ ∗
Denver Developmental Screening Test, 224, 252 Holland Self-Directed Search, 158

Detroit Tests of Learning Aptitude, 241 Holtzman Inkblot Technique (HIT), 397

Developmental Indicators for the Assessment of Home Observation for Measurement of the
Learning, 224 Environment Inventory (HOME), 502

Developmental Test of Visual-Motor Integration, 250 House-Tree-Person, 320
D-48 test, 290 How Supervise?, 382
tactual form, 310

Differential Ability Scales (DAS), 116, 117 Illinois Tests of Psycholinguistic Abilities, 241

Domino Creativity Scale on the ACL (ACL Cr.), 210 Impact of Life Event Scale, 411

Dot Estimation task, 376 In-basket technique, 372
Draw-A-Man Test, 403 Individual Differences Questionnaire (IDQ), 46, 214
Draw-A-Person (DAP), 403 Individualized Classroom Environment
Dutch cognitive battery, 310 Questionnaire (ICEQ), 504

Dyadic Adjustment Scale, 508 Infant Temperament Questionnaire, 230–231

Intimacy-permissiveness scale, 205
Eating Disorder Inventory, 408 Inventory of Psychosocial Balance (IPB), 84

Educational Testing Service (ETS), 301 Inventory of Test Anxiety, 457
Edwards Personal Preference Schedule (EPPS), 76 Inwald Personality Inventory (IPI), 378

Environmental Personality Inventory, 512 Iowa Tests of Basic Skills, 330
Environmental Preference Questionnaire (EPQ),
506 Jenkins Activity Survey, 411
Environmental Check List (ECL), 512 Job Components Inventory (JCI), 379


F (fascist) scale (California F scale), 421 Kaufman Assessment Battery for Children (K-ABC),

Family Adaptability and Cohesion Evaluation Scales, 118
507 Kuder General Interest Survey (KGIS), 157
Family Apperception Test, 509 Kuder Occupational Interest Survey (KOSI), 157
Family Systems Test (FAST), 509 Kuder Vocational Preference Record (KVPR), 157

Fear Questionnaire (FQ), 501 Kuhlmann-Binet, 525
Fear Survey Schedule, 499

Fear thermometer, 499 Learning Environment Inventory, 505

Frostig Developmental Test of Visual Perception, Learning Potential Assessment Device (LPAD), 95
311 Legal Attitudes Questionnaire (LAQ), 422
∗ ∗
Fullard Toddler Temperament Scale, 230–231 Leisure Activities Blank, 512

Leiter International Performance Scale, 246
Geist Picture Interest Inventory, 158–159, 302 Liberalism-conservatism scale, 131
∗ ∗
General Aptitude Test Battery (GATB), 380 Life Experience Survey, 411
General Health Survey, 415 Life Satisfaction, 262

Georgia Court Competency Test, 420 Life Situation Questionnaire (LSQ), 322
P1: JZP
0521861810ind CB1038/Domino 0 521 86181 0 February 24, 2006 14:51

Test Index 625

∗ ∗
Locke-Wallace Marital Adjustment Test, 508 Parsons Visual Acuity Test, 245
Locus of Control Scale (I-E), 202, 203 Peabody Picture Vocabulary Test-Revised (PPVT-R),

London House Personnel Selection Inventory, 238, 239, 317
385 Penile Tumescence Measurement (PTM), 494
Luria-Nebraska Children’s Neuropsychological Test Perkins-Binet Test of Intelligence for the Blind, 245,
Battery, 247 309
Personal Rigidity Scale, 23

MacArthur Competence Assessment Tool (MacCAT), Personality Inventory for Children (PIC), 231
420 Personality Research Form (PRF), 78

Marital Satisfaction Inventory, 510 Philadelphia Geriatric Center Morale Scale, 264
Marital Satisfaction Questionnaire for Older Persons ∗ Pictorial Inventory of Careers, 302

(MSQFOP), 263 Pictorial Test of Intelligence, 238

Marlowe-Crowne Social Desirability Scale, 446 Piers-Harris Self-Concept Scale, 226
∗ ∗
Maslach Burnout Inventory, 434 Pintner-Patterson Performance Scale, 525

Mattis Dementia Rating Scale, 266 Pleasant Event Schedule (PES), 495
∗ ∗
McAndrew Alcoholism Scale of the MMPI, 408 Preschool Screening System, 224

McCarney Attention Deficit Disorders Evaluation Psychological Screening Inventory (PSI), 166

Scale, 237–238 Purdue Hand Precision Test, 357
McCarthy Scales of Children’s Abilities, 254

McClenaghan and Gallahue checklist, 238 Q set of self-concept, 513
McGill Pain Questionnaire (PQ), 416 Q set, stages of Kubler-Ross, 515
McMaster Family Assessment Device (FAD), 508

Medical College Admission Test (MCAT), 348 Randt Memory Test, 267

Memory Assessment Clinics Self-Rating Scale Rathus Assertiveness Schedule, 497
(MAC-S), 269 Raven’s Progressive Matrices (PM), 288

Mental Status Exam (MSE), 162, 163, 266 Reading-Free Vocational Interest Inventory-Revised,

Metropolitan Achievement Tests, 330 302
∗ ∗
Michigan Alcoholism Screening Test (MAST), 408 Receptive and Expressive Emergent Language Scale,

Miller Analogies Test, 347 241

Miller Hope Scale, 216 Referral Decision Scale (RDS), 420
∗ ∗
Mill Hill Vocabulary Scale, 288 Reid Report, 385
∗ ∗
Millon Adolescent Personality Inventory, 184 Reitan-Indiana Test Battery, 247
∗ ∗
Millon Behavioral Health Inventory, 184 Reynell Developmental Language Scales, 241

Millon Clinical Multiaxial Inventory (MCMI), 179, Roberts Apperception Test for Children, 398
451, 455 Rokeach Value Survey (RVS), 143

Mini-mental State Examination, 266 Rorschach Inkblot Technique, 394

Minnesota Multiphasic Personality Inventory Rothbart Infant Behavior Questionnaire, 230–231
(MMPI), 170 Rotter Incomplete Sentences Blank, 402

Minnesota Multiphasic Personality Rotter’s Interpersonal Trust Scale, 434
Inventory-Revised (MMPI-2), 170

Minnesota Preschool Inventory, 224 Scales for Rating the Behavioral Characteristics of

Minnesota Rate of Manipulation Test (MRMT), 308 Superior Students, 245, 246
∗ ∗
Morale scale, 260, 264 Schedule for Affective Disorders and Schizophrenia
Myers-Briggs Type Indicator (MBTI), 74, 211 (SADS), 270
Schizotypal personality questionnaire (SPQ), 186
National Assessment of Educational Progress Scholastic Aptitude Test (SAT), 334
(NAEP), 333 School and College Ability Tests III (SCAT III), 122
Navy Basic Test Battery, 376 Self-Concept Scale for the Hearing Impaired (SSHI),
NEO Personality Inventory-Revised (NEO-PI-R), 89 318

Neurobehavioral Cognitive Status Examination, 266 Self-Consciousness Inventory (SCI), 86
∗ ∗
Northwestern Syntax Screening Test, 241 Self-Esteem Inventory, 226
Self-Esteem Questionnaire (SEQ), 200
∗ ∗
O’Connor Tweezer Dexterity Test, 357 Self-perception Inventory, 226
Ostomy Adjustment Scale, 322 Self-Rating Depression Scale (SDS), 194
Otis-Lennon School Ability Test (OLSAT), 123 Self-report measure of social competence, 331
P1: JZP
0521861810ind CB1038/Domino 0 521 86181 0 February 24, 2006 14:51

626 Test Index

Semantic Differential (SemD), 134 Thematic Apperception Test, 398


Sexuality Scale SS, 204 Time urgency, 374

Short-Form General Health Survey (SF-36), 415 Token Test, 241

Short Portable Mental Status Questionnaire, 266 Torrance Test of Creative Thinking (TTCT), 206
Sickness Impact Profile, 415
Slosson Intelligence Test (SIT), 124 UCLA Loneliness Scale (ULS), 219
∗ ∗
Snellen chart, 245, 312 Unpleasant Events Schedule, 496
Snyder Hope Scale (SHS), 217
Social Desirability Scale (Edwards), 444 Value Survey (Rokeach) (RVS), 143
Social Readjustment Rating Scale, 412 Verbalizer-Visualizer Questionnaire (VVQ), 215
Speed of Thinking Test (STT), 125 Vineland Adaptive Behavior Scale, 234, 320
Spielberger Test Anxiety Inventory, 456 Vineland Social Maturity Scale, 234
Standards for Educational and Psychological Testing, Vividness of Visual Imagery Questionnaire (VVIQ),
10, 306 214
∗ ∗
Stanford Achievement Tests, 330 Vocational Interest Inventory, 158

Stanford Achievement Tests- Special Edition, 317 Vocational Interest Survey, 158

Stanford-Binet, 101 Vulpe Assessment Battery, 238
Stanford-Binet, special forms, 105
∗ ∗
Stanton Survey, 385 Washington University Sentence Completion Test,
State-Trait Anxiety Inventory (STAI), 187 402
Sternberg Triarchic Abilities Test, 96 Water Rating Scale, 505
∗ ∗
Strong Campbell Interest Inventory (SCII), 148, 149 Watson-Glaser Critical Thinking Appraisal, 357
Strong Interest Inventory (SII), 149 Wechsler Adult Intelligence Scale (WAIS), 105

Strong Vocational Interest Blank (SVIB), 148–149 Wechsler Adult Intelligence Scale-Revised (WAIS-R),
Structure of Intellect Learning Abilities Test 105

(SOI-LA), 120 Wechsler-Bellevue Intelligence Scale, 105
Structured Clinical Interview for DSM-III (SCID), Wechler Intelligence Scale for Children (WISC),
163, 164 110
Study of Values (SoV), 142 Wechler Intelligence Scale for Children-Revised
Suinn-Lew Asian Self-Identity Acculturation Scale (WISC-R), 110
(SL-ASIA), 286 Wechler Intelligence Scale for Children III
Supervisory Practices Test (SPT), 379 (WISC-III), 110
Symptom Checklist 90R (SCL-90R), 163, 164 Wechsler Memory Scale (WMS), 268
System of Multicultural Pluralistic Assessment Wechsler Memory Scale-Revised (WMS-R), 268
(SOMPA), 291 Wechsler Preschool and Primary Scale of Intelligence
(WPPSI), 114

Taylor Manifest Anxiety Scale, 456 Wechsler Preschool and Primary Scale of
Teacher Rating Scales of childhood behavior Intelligence-Revised (WPPSI-R), 114

problems, 330 Wide Range Interest Opinion Test, 158–159, 302

Tell-me-a-story (TEMAS), 293 Wisconsin Personality Disorders Inventory (WISPI),
Tennessee Self Concept Scale (TSCS), 197 185
Test Anxiety Inventory (TAI), 456 Wonderlic Personnel Test (QPT), 383

Test Anxiety Questionnaire, 456 Woodworth Personal Data Sheet, 526
Test Anxiety Scale (TAS), 456 Work Orientation scale of the CPI (WO), 383

Test Anxiety Scale for Children (TASC), 457 Worry-Emotionality Questionnaire, 456
Tests of General Educational Development (GED),

332 Yerkes Point Scale, 525
P1: JZP
0521861810acr CB1038/Domino 0 521 86181 0 January 27, 2006 14:29

Index of Acronyms

ACL Adjective Check List F Fascist scale (California F scale)


ACL Cr Domino Creativity Scale on the ACL FAD McMaster Family Assessment Device
AFQT Armed Forces Qualification Test FAST Family Systems Test
ARSMA Acculturation Rating Scale for Mexican FQ Fear Questionnaire
Americans
ASES Adult Self-expression Scale GATB General Aptitude Test Battery
ASVAB Armed Services Vocational Aptitude GED Tests of General Educational
Battery Development
AUT Alternate Uses Test GRE Graduate Record Examination

BAI Beck Anxiety Inventory HIT Holtzman Inkblot Technique


BAS British Ability Scales HOME Home Observation for Measurement of
BCSHS Brief College Student Hassles Scale the Environment Inventory
BDI Beck Depression Inventory
BITCH-100 Black Intelligence Test of Cultural ICEQ Individualized Classroom Environment
Homogeneity Questionnaire
BP Boredom Proneness Scale IDQ Individual Differences Questionnaire
I-E Locus of Control scale
C Conservatism Scale (Internal-external)
CABS Children’s Assertive Behavior Scale IPB Inventory of Psychosocial Balance
CBCL Child Behavior Checklist IPI Inwald Personality Inventory
CES Classroom Environment Scale
CES-D Center for Epidemiologic JCI Job Components Inventory
Studies-Depression
CIA Children’s Inventory of Anger K-ABC Kaufman Assessment Battery for
CPI California Psychological Inventory Children
CVS Chinese Value Survey KGIS Kuder General Interest Survey
KOIS Kuder Occupational Interest Survey
DAP Draw-A-Person KVPR Kuder Vocational Preference Record
DAS Death Anxiety Scale
DAS Differential Ability Scales LAQ Legal Attitudes Questionnaire
DAT Dental Admission Testing Program LPAD Learning Potential Assessment Device
DCBC Daily Child Behavior Checklist LSQ Life Situation Questionnaire

ECL Environmental Check List MAC-S Memory Assessment Clinics Self-Rating


EPPS Edwards Personal Preference Scale
Schedule MAST Michigan Alcoholism Screening Test
EPQ Environmental Preference MBTI Myers-Briggs Type Indicator
Questionnaire MCAT Medical College Admission Test

627
P1: JZP
0521861810acr CB1038/Domino 0 521 86181 0 January 27, 2006 14:29

628 Index of Acronyms

MCMI Millon Clinical Multiaxial Inventory SIT Slosson Intelligence Test


MMPI Minnesota Multiphasic Personality SL-ASIA Suinn-Lew Asian Self-Identity
Inventory Acculturation Scale
MMPI-2 Minnesota Multiphasic Personality SOI-LA Structure of Intellect Learning Abilities
Inventory-Revised Test
MRMT Minnesota Rate of Manipulation Test SOMPA System of Multicultural Pluralistic
MSE Mental Status Exam Assessment
MSQFOP Marital Satisfaction Questionnaire for SoV Study of Values
Older Persons SPQ Schizotypal Personality Questionnaire
SPT Supervisory Practices Test
NAEP National Assessment of Educational SS Sexuality Scale
Progress SSHI Self-Concept Scale for the Hearing
NEO-PI-R NEO Personality Inventory-Revised Impaired
STAI State-Trait Anxiety Inventory
OLSAT Otis-Lennon School Ability Test STT Speed of Thinking Test
SVIB Strong Vocational Interest Blank
PES Pleasant Event Schedule
P-F Rosenzweig Picture-Frustration TAI Test Anxiety Inventory
Study TASC Test Anxiety Scale for Children
PIC Personality Inventory for Children TEMAS Tell-me-a-story
PM Raven’s Progressive Matrices TSCS Tennessee Self Concept Scale
PPVT-R Peabody Picture Vocabulary Test – TTCT Torrance Test of Creative Thinking
Revised
PQ McGill Pain Questionnaire ULS UCLA Loneliness Scale
PRF Personality Research Form
PSI Psychological Screening Inventory VVIQ Vividness of Visual Imagery
PTM Penile Tumescence Measurement Questionnaire
VVQ Verbalizer-Visualizer Questionnaire
RDS Referral Decision Scale
RVS Rokeach Value Survey WAIS Wechsler Adult Intelligence Scale
WAIS-R Wechsler Adult Intelligence Scale –
SADS Schedule for Affective Disorders and Revised
Schizophrenia WISC Wechsler Intelligence Scale for Children
SAT Scholastic Aptitude Test WISC-III Wechsler Intelligence Scale for
SCAT III School and College Ability Children III
Tests, III WISC-R Wechsler Intelligence Scale for Children
SCI Self-Consciousness Inventory – Revised
SCID Structured Clinical Interview for WISPI Wisconsin Personality Disorders
DSM-III Inventory
SCII Strong Campbell Interest WMS Wechsler Memory Scale
Inventory WMS-R Wechsler Memory Scale – Revised
SCL-90R Sympton Checklist 90R WO Work Orientation Scale of the CPI
SDS Self-Rating Depression Scale WPPSI Wechsler Preschool and Primary Scale
SemD Semantic Differential of Intelligence
SEQ Self-Esteem Questionnaire WPPSI-R Wechsler Preschool and Primary Scale
SF-36 Short-Form General Health Survey of Intelligence – Revised
SHS Snyder Hope Scale WPT Wonderlic Personnel Test
SII Strong Interest Inventory 16PF Cattell 16 Personality Factors
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

Subject Index

AAMD Adaptive Behavior Scale, 234 Anger, 492


Abbreviated scales, 109 Children’s Inventory, 492
Acculturation, 284 Angoff method, 354
Rating Scale for Mexican Americans (ARSMA), Anxiety
286 about aging, 261
Scale (Marin), 285 state-trait, 187
Scale (Olmedo), 286 test, 456
scales of, 284 APA Committee on test bias, 276
Achievement tests, 303 APA ethics code, 9
California Achievement Tests, 328 APA Task Force on integrity tests, 386
college entrance, 339 Aptitude by treatment interaction, 120
Acquiescence, 427, 430, 431, 448 Aptitudes, 303
scale, 449 Armed Forces Vocational Aptitude Battery (ASVAB),
ACT Interest Inventory, 158 376
Activities of Daily Living, 415 Army Alpha, 525–526
Actuarial method, 467 Army Beta, 525–526
clinical method and, 467 Assertiveness, 496
Adaptive behavior, 234 Assessment
Inventory for Children (ABIC), 234 clinical, 2
Adaptive testing, 473 direct, 22
Stanford-Binet, 103 performance, 22
Adjective checklists, 511 vs. testing, 2
Adjective Check List (ACL), 209, 513 Assessment centers, 371
Admissions testing, 302 Attention Deficit Hyperactivity Disorder, 237
Adult Self-Expression Scale (ASES), 497 teacher’s rating scale, 237
Adverse impact, 359, 387 Attitude scale construction
Age norms, 35 Check Lists, 137, 138
Aging Anxiety Scale, 261 Equal appearing intervals (Thurstone), 129
Alcoholism, 407, 408 Scalogram analysis (Guttman), 133, 205,
Alpha (Cronbach), 47 260
Alternate form reliability, 44 Semantic Differential, 134
Alternate Uses Test (AUT), 209 Social distance (Bogardus), 133
Alternation ranking, 362 steps in development, 140
Alzheimer’s, 266 Summated ratings (Likert), 131
Disease Assessment Scale, 266 Writing items, 140, 141
American College Testing Program (ACT), 335 Attitudes, 127
American Council on Education, 332 defined, 128
American Sign Language, 313 measurement, 129
Americans with Disabilities Act, 424 theoretical components
Analogies, 20 towards elderly, 260

629
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

630 Subject Index

Attitudes (cont.) Cain-Levine Social Competency Scale, 230–231


towards hearing impaired, 315 California Achievement Tests, 328
See also specific topics, tests California F scale, 421
Attraction-selection-attrition model, 363 California Psychological Inventory (CPI), 39, 80
Authentic-measurement, 22 California Q set, 513
Authoritarian personality, 138, 421, 530 Candidate Profile Record, 365–366
Career Assessment Inventory, 159
Back translation, 220, 284 Career Decision Scale, 159
Banding, 373 Carrow Elicited Language Inventory, 241
Bandwidth-fidelity, 30, 57, 487 Categories of tests, 5
Barnum effect, 470 Cattell 16 Personality Factors (16PF), 72
Basal level, 103 Cattell Culture Fair Intelligence Test, 287
Base rate, 63, 129, 195 CBTI Guidelines, 476
Base rate scores, 182 Ceiling level, 103
Battery, 2, 162, 163 Center for Edpidemiologic Studies-Depression
Bayley Scales of Infant Development, 228 (CES-D), 193
Beck Depression Inventory (BDI), 189, 451 Cerebral palsy, 238
Bedford Alzheimer Nursing Scale, 267 Certification, 7, 352
Behavior Problem Checklist, 252 Check lists, 487, 490
Behavior rating scales, 251, 320 attitudes, 137, 138
Behavioral assessment, 69, 478, 483, 484 reliability, 491
Bell Curve, The (Herrnstein and Murray), 531 validity, 491
Bender Visual Motor Gestalt Test, 404 Child Behavior Checklist (CBCL), 246, 252
Benton Visual Retention Test, 267 Children’s Adaptive Behavior Scale, 234
Berkeley Growth Study, 527 Children’s Apperception Test, 398
Bias, 272 Children’s Assertive Behavior Scale (CABS), 498
intercept bias, 277 Children’s Inventory of Anger (CIA), 492
on SAT, 337 Chinese Tangrams, 211
slope bias, 277 Chinese Value Survey (CVS), 365
Binet, Alfred, 519 Civil Rights Act, 306, 417, 422, 425
Binet-Simon Scale Clark-Madison Test of Oral Language, 241
(1905), 93, 100 Classification, 2, 7
(1908), 101 Classroom Environment Scale (CES), 504
(1911), 101 Clinical assessment, 162
(1916), 101 Clinical intuition, 38
(1937), 101 Clinical vs statistical approaches, 38, 406
(1960), 101 Closed response option, 141
(1972), 101 Coaching, 340, 351
(1986), 102 Coefficient alpha, 47
Biodata (biographical data), 365, 453 Coefficient kappa, 48
Black Intelligence Test of Cultural Homogeneity Coefficient of reproducibility, 134
(BITCH.100), 292–293 Cognition, 92
Blacky Pictures Test, 398 abilities, 308
Boehm Test of Basic Concepts, 241 hearing impaired, 315
Bogardus method, 133 Cognitive Capacity Screening Examination, 267
Bogardus Social Distance Scale, 133 Cognitive styles, 428
Boredom Proneness Scale (BP), 87 Cohen’s Kappa coefficient, 48, 490
Bracken Basic Concept Scale, 241 Cohesion and control, 508
Branching, 474 College Characteristics Index, 505
Brazelton Behavioral Assessment Scale, 230–231 College Entrance Examination Board, 334
Brief College Students Hassles Scale (BCSHS), Colorado Childhood Temperament Inventory,
218 230–231
Brigance Inventory of Early Development, 327 Columbia Mental Maturity Scale, 238
British Ability Scales (BAS), 116 Combining test scores, 38
Bruiniks-Oseretsky Test of Motor Proficiency, 238 Competence to stand trial, 419
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

Subject Index 631

Competency Screening Test, 420 See also Minorities; specific groups


Competitiveness, 215 Cross-validation, 24, 57
index, 216 Crystallized abilities, 102
Completion items, 20 intelligence, 287
Comprehensive Identification Process, 224 Culture fair tests, 282
Computer, 460 hearing impaired, 317
anxiety, 466 Cutoff score, 36, 39, 353
configural scoring, 462 multiple, 39
disabled, 479
Halstead-Reitan and, 473 Daily Child Behavior Checklist (DCBC), 493
test administration, 462 Daubert standard, 422
test scoring, 461 Death and dying, 265
See also specific tests Death anxiety, 219
Computer based test interpretations (CBTI), 467 scale (DAS), 219
Computer testing, 461 Death Images Scale, 265
GRE, 341 Decentering, 284
Concept Assessment Kit, 95 Decision theory, 59, 181
Concurrent validity, 54 bias, 277
Confidentiality, 10 Luria-Nebraska, 249
Configural interpretation, 173, 175 tests, 4
Configural scoring, 462 validity, 59
Conners Rating Scales, 253 D–48 test, 290
Conservatism Scale (C scale), 138 tactual form, 310
Consistency score, 77 D index, 136
Construct validity, 54 Delta scores, 30
methods, 55 Denman Neuropsychology Memory Scale, 267
Consumer satisfaction studies, 469 Dental Admission Testing Program (DAT), 352
Content analysis, 17 Denver Developmental Screening Test, 224, 252
MMPI, 173, 175 Depression, 269, 495
Content validity, 53, 71 Derived scores, 17, 25
Contrasted groups, 55, 56 percentiles, 25
Convergent and discriminant validity, 56 Detroit Tests of Learning Aptitude, 241
Convergent thinking, 203 Developmental Indicators for the Assessment of
Coping, 265 Learning, 224
Coronary-prone behavior, 411 Developmental Test of Visual-Motor Integration, 250
Correction for attenuation, 49, 57 Deviation IQ
Correlation coefficient Diagnostic and Statistical Manual (DSM), 161
Q sets, 514 Differential Ability Scales (DAS), 116, 117
rank order, 146 Differential validity, 57
Countdown method, 474 Difficulty of test items, 28
Couples, assessment, 509 Direct assessment, 22
Creativity, 205 Direct magnitude estimation, 412
biodata, 370 Disabilities
hearing impaired, 320 categories, 297
Creativity index, 211 physical-motor disabilities, 321
Criterion contamination, 54 Discriminant analysis, 40
Criterion keying, 23, 70, 150, 152 examples, 169, 435, 436
Criterion reference, 7, 37 Discriminant validity, 56
Criterion validity, 54, 358 Distance-cluster analysis, 136
Cronbach’s alpha, 47 Distractors, 19
Cross-cultural testing, 204, 272, 413 Divergent thinking, 206
racial distance quotient, 133 Domino Creativity Scale on the ACL (ACL Cr),
racial norming, 281, 305 210
ratings, 363 Dot Estimation task, 376
response styles, 435 Downey Will Temperament Test, 527
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

632 Subject Index

Drawing techniques, 250, 402 impression management, 429


Draw-A-Man Test, 403 self-deceptive enhancement, 430, 431
Draw-A-Person (DAP), 403 False negative, 60
hearing impaired, 315 False positive, 60
Dutch cognitive battery, 310 Family Adaptability and Cohesion Evaluation Scales,
Dyadic Adjustment Scale, 508 507
Family Apperception Test, 509
Eating disorder, 408 Family functioning, 506
inventory, 408 Family systems, 506
Ebel procedure, 354 test (FAST), 509
Education for All Handicapped Children Act, 223 Fear Questionnaire (FQ), 501
Educational Testing Service (ETS), 12, 28, 301 Fear Survey Schedule, 499
Edumetric, 37–38 Fear thermometer, 499
Edwards Personal Preference Schedule (EPPS), 76, Federal Rules of Evidence (FRE), 422
450 Feedback, 11
Effective therapists, 406 Fiat, 22
Efficiency. See Predictive value Filler items, 22, 434
Elaboration, 207 Five factor model, 88
Elderly, 257 job performance, 363–364
Empiricism, 33, 311 Flexibility, 207
Employee Polygraph Protection Act (1988), 387 Fluency, 207
Employment testing, 356 Fluid analytic abilities, 102
Environment Fluid intelligence, 287
classroom, 503 Focal assessment, 194
college, 505 Folk concepts, 80
home, 502 Forced choice items, 20, 434
physical, 505 Forced distribution method, 362
Environmental Check List (ECL), 512 Forensic psychology, 419
Environmental Personality Inventory, 512 Frostig Developmental Test of Visual Perception, 311
Environmental Preference Questionnaire (EPQ), 506 Fullard Toddler Temperament Scale, 230–231
Epistemology, 94
Equal appearing intervals, 129 g (general) factor, 96, 288
Equal Employment Opportunity Act, 422 occupational criteria, 359
Equivalence, 466 Geist Picture Interest Inventory, 158–159, 302
Erickson, E., 84 Gender differences
Error variance, 251, 326 fear, 501
Estimated learning potential SAT, 336
Ethics TAT, 401
issues, 476 General Aptitude Test Battery (GATB), 305, 380
standards, 9 General Educational Development (GED) Tests, 332
Etic and emic, 284 General Health Survey, 415
Examiner error, 109 Generalizability theory, 47, 64, 165–166, 463, 489
Exner Comprehensive System, 397 Georgia Court Competency Test, 420
Expectancy table, 35, 59, 373 revised, 420
Experimenter variables, 3 Geriatric Scale of Recent Life Events, 265
Expressed interests, 146 Gerontological Apperception Test, 270, 398
Extreme groups, 33 Gesell Developmental Schedules, 228, 527
Gibb Test of Testwiseness, 458
F (fascist) scale, 138, 421 Gifted children, 245
Face validity, 56 GPA
Factor analysis, 17, 18, 24, 56 criterion, 341, 347
Factor weights, 87 GRE, 344
Factorial invariance, 165 medical school, 349
Factorial validity, 73, 209 restriction of range, 345
Faking, 154, 427 SAT, 341
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

Subject Index 633

Grade norms, 35 Index of discrimination, 32


Graduate Record Examination (GRE), 342 Individual differences, 46
Grassi Basic Cognition Evaluation, 245, 246 Questionnaire (IDQ), 214
Griggs vs. Duke Power, 423 Individualized Classroom Environment
Group vs. individual tests, 5 Questionnaire (ICEQ), 504
Guessing, 31 Infant intelligence, 227
correction, 31 Infant Temperament Questionnaire, 230–231
Guidelines for Computer-based Tests and Influences on tests, 3
Interpretations, 476 experimenter variables, 3
Guttman scale, 133, 205, 260 method of administration, 3
situational variables, 3
Hall, G. S., 525 subject variables, 3–4
Halo, 251, 361 Information about tests, 11
Halstead-Reitan Informed consent, 1
computer and, 473 Insanity, 420
Neuropsychological Battery, 390, 455 Institute of Personality Assessment and Research
Neuropsychological Test Battery for Children, 247 (IPAR), 24, 209, 210
Hardiness, 410 Instrumental values, 143
test, 410 Integrity tests, 384, 452
Hassles and uplifts, 414 Intelligence
Hassles scale, 218 academic achievement, 97
scale, 414 academic vs. practical, 97
Hayes-Binet, 525 age scale vs. point scale, 98
Health belief model, 411 anthropological model, 95
Health psychology, 409 biological model, 94
Health status, 414 cognitive theory, 95
Healy-Fernald tests, 525 computational model, 94
Hearing impairment, 242, 312 developmental periods, 94
Hierarchical theories, 97 epistemological model, 94
Hiskey-Nebraska Tests of Learning Aptitude, 243 fluid, 287
Hispanics. See Latinos geographic model, 94
History, 517 global-multiple, 93
American applied orientation, 522 hierarchical theories, 97
British idiographic approach, 520 infant, 227
French clinical tradition, 518 job performance, 97
German nomothetic approach, 519 metaphors, 94
Holland Self-Directed Search, 158 multiple intelligences, 95
Holland’s theory, 149, 153 nature-nurture, 278
Holtzman Inkblot Technique (HIT), 397 Piaget’s stages, 94
Home Observation for Measurement of the quotient, 99, 531
Environment Inventory (HOME), 502 structure of intellect model, 95
Homogeneity, 45 See also specific topics, tests
groups, 57 Intelligent testing, 98
items, 45 Intercept bias, 277
Hope, 216 Interest inventories
House-Tree-Person, 320 disadvantaged, 158
How Supervise?, 382 Kuder inventories, 157
Hysteria, 518 nonprofessional occupations, 159
Strong Interest Inventory, 148
Illinois Tests of Psycholinguistic Abilities, 241 Interests, 149
Imagery, 213 expressed, 157
Impact of Life Event Scale, 411 inventoried, 157
Impression management, 508 Interitem consistency, 46
Impressionistic interpretation, 7, 514 Internal consistency, 33, 45, 56
In-basket technique, 372 International Classification of Diseases, 161
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

634 Subject Index

Interquartile range, 130 GRE performance, 294, 347, 348


Interrater reliability, 490 SAT, 294, 337
Interscorer reliability, 212 Learning disabilities, 237
Interval scales, 8 Learning Environment Inventory, 505
Interview, 485 Learning Potential Assessment Device (LPAD), 95
Intimacy-permissiveness scale, 205 Legal Attitudes Questionnaire (LAQ), 421, 422
Intrascorer reliability, 212 Legal cases
Inventory of Psychosocial Balance (IPB), 84 Albemarle Paper Co. v. Moody, 423
Inventory of Test Anxiety, 457 Connecticut v. Teal, 423
Inwald Personality Inventory (IPI), 378 Daubert v. Merrell Dow Pharmaceuticals, 423
Iowa Tests of Basic Skills, 330 Debra P. v. Turlington, 424
Ipsative measurement, 7 DeFunis v. Odegaard, 423
Edwards Personal Preference Schedule, 76 Diana v. State Board of Education, 425
Q sets, 513 Gaines v. Monsanto, 425
Study of Values, 142 Griggs v. Duke Power Company, 423
IQ. See Intelligence, quotient Guadalupe v. Tempe Elementary District, 425
Item Larry P. v. Riles, 425
analysis, 17 Myart v. Motorola, 423
banking, 474 Parents in Action on Special Education v. Hannon,
categories, 18 425–426
difficulty, 28 Sharif v. NY State Education Department, 426
discrimination, 31 Target Stores, 425
homogeneity, 45 Watson v. Fort Worth Bank and Trust, 423
sampling, 44 Legal issues, 422, 477
selection, 99 Leisure Activities Blank, 512
structure, 6 Leiter International Performance Scale, 246
types, 19 Liberalism-conservatism scale, 131
writing, 18, 141 Licensure and certification, 352
Item keying Lie scales, 438, 450
biodata, 366, 453 Life events, 411
Item response distribution, 156 Life Experience Survey, 411
Item response theory (IRT), 34 Life satisfaction, 261
Life Situation Questionnaire (LSQ), 322
Jenkins Activity Survey, 411 Likert method, 131
Job Components Inventory (JCI), 379 Locator tests, 329
Job performance Locke-Wallace Marital Adjustment Test, 508
intelligence, 97 Locus of control, 202
personality, 363 scale, 202, 203
prediction, 359 London House Personnel Selection Inventory, 385
Job success, 358 Loneliness, 218
Jungian theory, 75 Luria-Nebraska Children’s Neuropsychological Test
Battery, 247
Kappa coefficient, 48, 490
Kaufman Assessment Battery for Children (K-ABC), MacArthur Competence Assessment Tool (MacCAT),
118 420
Keyed response, 18–19 Marital satisfaction, 263
Kuder General Interest Survey (KGIS), 157 Inventory, 510
Kuder Occupational Interest Survey (KOSI), Marital quality, 508
157 Questionnaire for Older Persons (MSQFOP), 263
Kuder-Richardson Formula, 47 Marlowe-Crowne Social Desirability Scale, 446
Kuder Vocational Preference Record (KVPR), 157 Maslach Burnout Inventory, 434
Kuhlmann-Binet, 525 Matching items, 20
Mattis Dementia Rating Scale, 266
Lake Wobegon effect, 330 Mayo clinic, 460
Lambda scores, 158 McAndrew Alcoholism Scale of the MMPI, 408
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

Subject Index 635

McCarney Attention Deficit Disorders Evaluation response options, 18, 21–22


Scale, 237–238 vs. essay, 334
McCarthy Scales of Children’s Abilities, 254 Multiple factors, 96
McClenaghan and Gallahue checklist, 238 Multiple regression, 39
McGill Pain Questionnaire (PQ), 416 Multitrait-multimethod matrix, 55, 56, 360
McMaster Family Assessment Device (FAD), 508 Multivariate tests, 18
Medical College Admission Test (MCAT), 348 Muscular dystrophy, 238
Memory, 267 Myers-Briggs Type Indicator (MBTI), 74, 211
Memory Assessment Clinics Self-Rating Scale
(MAC-S), 269 National Assessment of Educational Progress
Mental age, 99 (NAEP), 333
Mental deficiency, 234, 524 Nativism, 311
Mental Measurements Yearbook, 5, 11–12 Naturalistic observation, 485
Mental Status Exam, 162, 163, 266 Nature/nurture, 93, 528
Meta-analysis, 57 Navy Basic Test Battery, 376
Metropolitan Achievement Tests, 330 Nedelsky method, 354
Mexican-Americans. See Latinos NEO Personality Inventory-Revised (NEO-PI-R),
Michigan Alcoholism Screening Test (MAST), 408 89
Military, 376 Neurobehavioral Cognitive Status Examination, 266
Armed Services Vocational Aptitude Battery, 376 Neuropsychological assessment, 390
Army Selection and Classification Project, 376 children, 245, 246
Dot estimation task, 376 computerized, 472
Navy Basic Test Battery, 376 elderly, 266
visual spatial abilities, 377 faking
Miller Hope Scale, 216 Halstead-Reitan Neuropsychological battery, 390
Mill Hill Vocabulary Scale, 288 NOIR system, 8
Millon Adolescent Personality Inventory, 184 interval scales, 8
Millon Behavioral Health Inventory, 184 nominal scales, 8
Millon Clinical Multiaxial Inventory (MCMI), 179, ordinal scales, 8
451, 455 ratio scales, 9
Mini-mental State Examination, 266 Nominal scales, 8
Minimum competency testing, 304 Nomothetic approach, 64, 519
Minnesota Multiphasic Personality Inventory Norm reference, 7
(MMPI), 170 Norms, 17, 34, 304
Minnesota Multiphasic Personality age scale, 34
Inventory-Revised (MMPI-2), 170 convenience samples, 34
Minnesota Preschool Inventory, 224 local, 37
Minnesota Rate of Manipulation Test (MRMT), 308 random sampling, 34
Minorities school grade, 35
cognitive abilities, 360 selection, 34
MCAT scores, 351 special, 308
MMPI, 295 stratified samples, 34
norms on GATB, 382 Northwestern Syntax Screening Test, 241
police, 378
See also Cross-cultural testing; specific groups, tests Objective-subjective continuum, 21
Moderator variables, 433 Observer reports, 129
Modified testing, 300, 308 Occupational choice, 375
hearing impaired, 314 Occupational testing, 356
Morale, 264 O’Connor Tweezer Dexterity Test, 357
scale, 264 Option keying, 368
Motor impairments, 238, 321 biodata, 453
Multiaxial approach, 161 Ordinal scales, 8
Multiple choice items, 18–19 Originality, 207
distractors, 19 Ostomy Adjustment Scale, 322
keyed responses, 18–19 Otis-Lennon School Ability Test (OLSAT), 123
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

636 Subject Index

Pain, 416 Postwar period, 526


Paired comparison, 362 Power tests, 6
Paper-and-pencil tests, 6 Practical intelligence, 97
equivalence with computer forms Predictive value, 58, 60
Parsons Visual Acuity Test, 245 Predictors
Pattern matching, 55 college achievement, 339
Peabody Picture Vocabulary Test-Revised (PPVT-R), Preemployment testing, 356
238, 239, 317 Preschool Screening System, 224
Pearson coefficient, 43 Preschool testing, 325
Penile Tumescence Measurement (PTM), 494 approaches, 326
Percentage of agreement, 397 Pretest, 17
Percentage overlap, 153 Primary factors, 310
Percentiles, 25 Primary validity, 65, 200, 201
Perceptual-motor assessment, 311 Principal components analysis, 112
Performance assessment, 22, 258 Privacy, 11
Performance tests, 6 Profile stability, 78
Perkins-Binet Test of Intelligence for the Blind, 245, Program evaluation, 2–3, 501
309 Project A, 376
Personal Rigidity Scale, 23 Projective techniques, 69, 326, 392
Personality, 67 family assessment, 509
content vs. style, 430 sentence completion tests, 401
definition, 67 Proprietary tests, 5
five factor model, 88 Psychological Screening Inventory (PSI), 166
Inventory for Children (PIC), 231 Psychophysical methods, 132
Research Form (PRF), 78 Public Law 88-352, 306
stylistic scales, 444 Public Law 94-142, 223, 307, 424
theoretical models, 68 Public Law 99-457, 223
traits, 69 Purdue Hand Precision Test, 357
types, 67, 70
Personality test construction Q index of interquartile range, 130
content validity, 71 Q index of overlap, 156
contrasted groups, 70 Q methodology, 513
criterion keying, 70 Q sets, 513
deductive method, 70 California, 514
external method, 70 self-concept, 515
fiat method, 71 stages of Kübler-Ross, 515
inductive method, 70 Q sort, 513
rational method, 71
stylistic scales, 431 Racial distance quotient, 133
theoretical approach, 70 Racial norming, 281, 305
Personnel selection, 357 Random responding, 431
Philadelphia Geriatric Center Morale Scale, 264 Randt Memory Test, 267
Phobias, 499 Rank-order correlation coefficient, 146
Physical-motor disabilities, 321 Rapport, 25, 258
Piaget theory, 94 Rater reliability, 48
Pictorial Inventory of Careers, 302 Rathus Assertiveness Schedule, 497
Pictorial Test of Intelligence, 238 Rating errors, 361
Picture interest inventories, 302 bias, 361
Piers-Harris Self-Concept Scale, 226 central tendency, 362
Pilot testing, 16–17 corrections for, 361
Pintner-Patterson Performance Scale, 525 halo, 361
Pleasant Event Schedule (PES), 495 leniency, 362
Police, 377 Rating scales, 139
Polygraph, 420 behavioral anchors, 362
Positional response bias, 435, 436 cultural differences, 363
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

Subject Index 637

graphic, 139 Rorschach Inkblot Technique, 394, 526


self vs. others, 361. See Self-rating scales Rothbart Infant Behavior Questionnaire, 230–231
supervisors, 360 Rotter Incomplete Sentences Blank, 402
Ratio IQ, 99 Rotter’s Interpersonal Trust Scale, 434
Ratio scales, 9 Rulon formula, 46
Raven’s Progressive Matrices (PM), 288
Raw scores, 17 Sample size, 63
Reactivity, 490 SAT, 334
Readability, 246 decline, 339
Reading-Free Vocational Interest Inventory-Revised, hearing impaired, 317, 318
302 Scales for Rating the Behavioral Characteristics of
Recentering, 335 Superior Students, 245, 246
Receptive and Expressive Emergent Language Scale, Scatter, 35, 112
241 Schedule for Affective Disorders and Schizophrenia
Referral Decision Scale (RDS), 420 (SADS), 270
Regression, 39 Schedule of Recent Experience, 411
biodata, 370 Schemas, 314
CPI, 440, 443 Schizotypal personality
equation for creativity, 211 questionnaire (SPQ), 186
generalizability, 350 Scholastic Aptitude Test (SAT), 334
SAT, 337 School and College Ability Tests III (SCAT III), 122
Rehabilitation Act of , 306, 424 School testing, 325
Reid Report, 385 Scientific inquiry
Reitan-Indiana Test Battery, 247 Secondary traits, 73
Reliability, 42 Score interpretation
alternate form, 44 criterion reference, 7
correction for attenuation, 49 norm reference, 7
Cronbach’s alpha, 47 Scorer reliability, 48
difference scores, 51 Scores
generalizability theory, 47 combining, 38
interitem consistency, 46, 47 delta, 30
interobserver, 48 derived, 25
Kuder-Richardson formula, 47 raw, 17
property of, 42 sources of error, 47
rater, 48 standard, 26
Rulon formula, 46 stanines, 28
scorer, 48 T, 28
Spearman-Brown forumla, 45 z, 26
split-half, 44 Scottish surveys, 522
standard error of differences, 50 Screening tests, 7
standard error of measurement, 49 Second order factors, 73
test-retest, 43 Secondary traits, 73
true vs. error, 42 Secondary validity, 65, 191, 200, 201
Repeated administration, 364 Security, 5
Response range, 149 Selection ratio, 62
Response sets, 427 Self actualization, 71, 410
Response time, 479 Self-anchoring scales, 139
Restriction of range, 150, 345 Self concept, 197, 226, 318
Retest reliability, 43 hearing impaired, 318
Reynell Developmental Language Scales, Self-Consciousness Inventory (SCI), 86
241 Self-Esteem Inventory, 226
Roberts Apperception Test for Children, Self-Esteem Questionnaire (SEQ), 200
398 Self monitoring, 481, 485
Rokeach Value Survey (RVS), 143 Self-perception Inventory, 226
Role playing, 481, 485 Self-Rating Depression Scale (SDS), 194
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

638 Subject Index

Self-rating scales Standardization, 1, 17


anchor points, 69 Standards for Educational and Psychological Testing,
attitude measurement, 139 10, 306
behavioral assessment, 487, 489 Stanford Achievement Tests, 330
memory, 269 Special Edition, 317
personality, 69 Stanford-Binet, 101
Self-report measures, 68, 194, 411, 489 special forms, 105
social competence, 331 Stanines, 28
Self-understanding, 2 Stanton Survey, 385
Semantic differential, 134, 321, 451 State-Trait Anxiety Inventory (STAI), 187
Sensitivity 518, 60, 100, 490, 499 Stems, 19, 20
Sentence completion tests, 401 Sternberg Triarchic Abilities Test, 96
Sequential strategies, 62 Stoma, 322
Sesame Street, 502 Stress, 411
Sexuality, 204, 494 Strong Campbell Interest Inventory (SCII), 148,
scale, 204 149
Short-Form General Health Survey (SF-36), Strong Interest Inventory (SII), 149
415 Strong Vocational Interest Blank (SVIB),
Short forms, 18 148–149
Short Portable Mental Status Questionnaire, Structure of Intellect
266 Learning Abilities Test (SOI-LA), 120
Sickness Impact Profile, 415 model, 95, 120, 208, 530
Situational methods, 69 Structured Clinical Interview for DSM-III (SCID),
Situational variables, 3 163, 164
Slope bias, 277 Study of Values (SoV), 142
Slosson Intelligence Test (SIT), 124 Stylistic scales, 433
Snellen chart, 245, 312 Subject variables, 3–4
Snyder Hope Scale (SHS), 217 Success criteria, 347, 348
Social desirability, 76, 444 Suinn-Lew Asian Self-Identity Acculturation Scale
Edwards’ scale, 444 (SL-ASIA), 286
Personality Research Form, 78 Summated ratings, 131
scales of, 78, 445, 446 Supervisory Practices Test (SPT), 379
Social emotional behavior, 230 Suppressor variables, 431
Social Readjustment Rating Scale, 412 Symptom Checklist 90R (SCL-90R), 163,
Social Sciences Citation Index, 12 164
Social skills, 496 Symptom validity testing, 435, 436
Social validity, 489 Synthetic validity, 374
Sociometric procedures, 129, 510 System of Multicultural Pluralistic Assessment
Spearman Brown formula, 45 (SOMPA), 291
Specificity, 60, 100
Specimen set, 5 Table of specifications, 16
Speech impairments, 240 Tactual forms, 310
Speed of Thinking Test (STT), 125 Taxonomy, 53
Speed tests, 5 Taylor & Russell tables, 64
Spielberger Test Anxiety Inventory, 456 Taylor Manifest Anxiety Scale, 456
Spina bifida, 238 Teacher rating scales, 330
Spinal cord injury, 322 childhood behavior problems, 330
Spiral omnibus, 22 Tell-me-a-story (TEMAS), 293
Split-half reliability, 44 Tennessee Self Concept Scale (TSCS), 197
Standard age scores, 103 Terman, Lewis M., 524
Standard deviation, 26 Terminal values, 143
Standard error of differences, 50 Tertiary validity, 65, 202
Standard error of estimate, 59 Test
Standard error of measurement, 49 administration, 3, 25
Standard scores, 26 affected by, 3
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

Subject Index 639

assessment, 2 Truth-in-testing, 424


battery, 2, 162, 163 T scores, 28
bias, 272 uniform, 173
categories, 5 Turnover, 375
commercially published, 5 Two-factor theory, 96
construction, 15 Type A behavior, 411
definition, 1
equivalence with computer forms, 327 UCLA Loneliness Scale (ULS), 219
experimental procedure, 3 Uniform Guidelines on Employee Selection
group, 5 Procedures, 356–357
individual, 5 Unitary weights, 87, 155
information about, 11 Unpleasant Events Schedule, 496
invasiveness, 6 U.S. Civil Service Commission, 305
levels, 11 U.S. Department of Labor, 305
maximal vs. typical performance, 8 U.S. Office of Personnel Management, 305
power, 5 Utility vs. validity, 338
predicted behavior, 5
proprietary, 5 Validity, 52
security, 6, 11 applications and, 65
short forms, 18 coefficients, 58
sophistication, 336, 457 concurrent, 54
specimen set, 5 construct, 54
speed, 5 content, 5
users, 2 convergent and discriminant validity, 56
See also specific topics, tests criterion, 54
Test anxiety, 456 differential, 57, 351
questionnaire, 456 external criteria, 275
scale, 456 face, 56
scale for children, 457 factorial, 73, 209
Test functions, 7 generalization and, 57, 64, 342, 350
certification, 7 individual, 65
classification, 2, 7 internal criteria, 275
diagnosis, 7 multitrait-multimethod matrix, 55, 56
placement, 7 predictive, 54
prediction, 7 primary, 65, 201
screening, 7 secondary, 65, 191, 201
selection, 7 social, 489
Testing and Selection Order, 356–357 synthetic, 374
Test purposes, 2 tertiary, 65, 202
classification, 2 utility and, 338
program evaluation, 2 Validity scales, 69, 437, 438
scientific inquiry CPI, 438
self-understanding, 2 MMPI, 437, 438, 450
Test-retest reliability, 43 MMPI-2, 440, 443
Testing the limits, 247 Value Survey (Rokeach) (RVS), 143
Tests in Microfiche, 12 Values, 141, 365
Tests of General Educational Development (GED), terminal vs. instrumental, 143
332 Variability, 46
Testwiseness, 457. See Test sophistication Variables
Thematic Apperception Test, 398 experimenter variables, 3
Thurstone method, 129, 527 moderator, 433
Time urgency, 374 situational, 2
Token Test, 241 subject, 5
Torrance Test of Creative Thinking (TTCT), 206 suppressor, 433
True-false items, 19 See also specific tests
P1: JZP
0521861810sin CB1038/Domino 0 521 86181 0 February 24, 2006 13:48

640 Subject Index

Variance, 43 hearing impaired, 314, 316


Vector scales (on CPI), 81–82 Intelligence Scale for Children (WISC), 110
Verbalizer-Visualizer Questionnaire (VVQ), Intelligence Scale for Children III (WISC-III), 110
215 Intelligence Scale for Children-Revised (WISC-R),
Verification score (on Kuder), 157 110
Vignettes, 21 pattern analysis, 108, 112
Vineland Adaptive Behavior Scale, 234, 320 Preschool and Primary Scale of Intelligence
Vineland Social Maturity Scale, 234 (WPPSI), 114
Visual acuity, 245, 312 Preschool and Primary Scale of
Visual impairment, 244, 307 Intelligence-Revised (WPPSI-R), 114,
Vividness of Visual Imagery Questionnaire (VVIQ), 115
214 Wechsler-Bellevue Intelligence Scale, 105
Vocational Interest Inventory, 158 Wechsler Memory Scale (WMS), 268
Vocational Interest Survey, 158 Wechsler Memory Scale-Revised (WMS-R), 268
Vocational placement, 298 Weighting
Vocational rehabilitation, 297 differential, 38, 87
Voir dire, 421 unit, 38, 155
Vulpe Assessment Battery, 238 Wide Range Interest Opinion Test, 158–159,
302
Washington University Sentence Completion Test, Wisconsin Personality Disorders Inventory (WISPI),
402 185
Water Rating Scale, 505 Wonderlic Personnel Test (QPT), 383
Watson-Glaser Critical Thinking Appraisal, Woodworth Personal Data Sheet, 526
357 Work Orientation scale of the CPI (WO), 383
Wechsler intelligence tests, 529 Work samples, 373
Adult Intelligence Scale-Revised (WAIS-R), Worry-Emotionality Questionnaire, 456
105
Adult Intelligence Scale (WAIS), Yerkes Point Scale, 525
105
Deterioration Quotient, 108 z scores, 26

Potrebbero piacerti anche