Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Portions of this research were presented at the 20th Annual Conference of the Society
for Industrial and Organizational Psychology, Inc., Los Angeles, CA. The authors thank
Alyssa Gibbons, Amanda Farthing, Sang Woo, Carra Sims, and Myungjoon Kim for their
assistance with this project. Bradley J. Brummel is now at the University of Tulsa.
Correspondence and requests for reprints should be addressed to Deborah E. Rupp,
Department of Psychology, 603 E. Daniel St., University of Illinois at Urbana-Champaign,
Champaign, IL, 61820; derupp@uiuc.edu.
C 2009 Wiley Periodicals, Inc.
137
138 PERSONNEL PSYCHOLOGY
judgment tests, and in-basket exercises (Clause, Mullins, Nee, Pulakos, &
Schmitt, 1998; Lievens & Anseel, 2008; Lievens & Sackett, 2007; Oswald,
Friede, Schmitt, Kim, & Ramsey, 2005), there is currently no formal
protocol for establishing parallelism among behavioral assessments (e.g.,
those scored by assessors, such as simulation exercises in assessment
centers). This paper makes an initial attempt to apply and extend standard
test development procedures into this domain.
In the sections that follow, we more fully describe the behavioral
assessment context. Then, we review the test equating literature as it ap-
plies to traditional test formats and highlight aspects of behavioral assess-
ment for which the traditional protocol cannot be applied. Given these
gaps, we propose extensions of established procedures for these con-
texts. Finally, using data from an operational assessment center, we apply
these guidelines and evaluate the effectiveness of the process. We con-
clude with a discussion of the limitations of our application and provide
suggestions for researchers and practitioners attempting to construct and
equate dimension-based behavioral simulations in the future.
TABLE 1
Dimensions Assessed
Dimension
Problem solving • Problem understanding
• Thinking solutions through
• Decisiveness
Oral communication • Verbal/nonverbal expression
• Message clarity
• Appropriate communication style
Leadership • Guidance of others
• Balance of needs
• Personal effectiveness
Conflict management • Effective strategies
• Handling conflict
• Constructive solutions
Information seeking • Use of multiple sources
• Situational relevance
• Creation of usable patterns
Planning and organizing • Goal setting
• Allocation of time and resources
• Monitoring and conducting planned activities
Fairness • Interpersonal sensitivity
• Appropriate outcomes
• Executing processes
Cultural adaptability • Understand cultural differences
• Culturally sensitive judgments
• Culturally appropriate communication
Simulation exercises are used for both selection and development, and
we argue that parallel simulations are potentially desirable in both con-
texts. Parallel versions of simulation exercises may be useful for adjusting
content to different settings or updating simulations with changes in tech-
nology or language. It has also been suggested that constructing alternate
forms of simulation exercises may be more cost effective than constructing
new simulation exercises from scratch (Lievens & Anseel, 2008). How-
ever, the most pressing reason for creating alternate forms of simulation
BRADLEY J. BRUMMEL ET AL. 141
exercises (as well as for tests of any kind) is the possible influence of
preknowledge on validity. Concerns regarding content compromise and
preknowledge have been documented in the literature (Lievens & Anseel,
2008; Lievens & Sackett, 2007), and evidence has been presented showing
how coaching can affect simulation scores (Brannick, Michaels, & Baker,
1989; Brostoff & Meyer, 1984; Moses & Ritchie, 1976; Petty, 1974).
However, we are unaware of any specific research that directly assesses
the effect of simulation preknowledge on validity. At this point it remains
a theoretical possibility.
In some ways, preknowledge of the specific content making up simula-
tion exercises could affect an individual’s performance in a similar manner
to what has been described for other testing formats (cf. McLeod, Lewis,
& Thissen, 2003). If an individual’s scores on the dimensions assessed in
a simulation are artificially inflated due to specific knowledge of, or ex-
perience with, the content of the simulation exercise, then the scores will
not accurately reflect the level of proficiency for that individual on those
dimensions. In such a situation, the validity of inferences made from these
scores would be questionable. However, there are also some aspects of
simulation exercise preknowledge that are unique to this method. That is,
with regard to traditional paper-and-pencil-based tests, if test takers gain
access to items, they can determine the answers and memorize both the
items and correct responses prior to a second test administration (if paral-
lel versions are not used), or share this information with future test takers
(Cizek, 1999). The case is somewhat different in assessment center con-
texts. In assessment center programs, it is common to inform participants
ahead of time of the types of exercises that will be used and the dimensions
that will be assessed (indeed, this is common in traditional high-stakes
testing as well). What assessment center participants do not have access
to are the stimulus materials making up the exercises and the scoring
rubrics used to rate them on the dimensions. Stimulus material includes
the actual information that is provided to the participants at the start of
the exercises, which depends completely on the situation being simulated.
It might include memos and letters, personnel files, descriptions of the
situation that the participant must deal with, and so on. Scoring rubrics
include the BARS or behavioral checklists that define what constitutes
proficiency on the dimensions in the particular situation being simulated.
For selection/promotion. In a selection/promotion context, if the stim-
uli material is compromised, participants would have the opportunity to
practice prior to taking part in the simulation exercises, therefore in-
flating their scores and reducing the criterion-related validity of the as-
sessment. Thus, as the number of assessment centers implemented in
organizations continues to increase worldwide, issues of test security
become increasingly important (Krause & Thornton, 2007; Kudisch et al.,
142 PERSONNEL PSYCHOLOGY
2001; Spychalski, Quiñones, Gaugler, & Pohley, 1997). Because they have
been established as a method producing scores with relatively substantial
predictive validity, while at the same time minimizing adverse impact
(Goldstein, Yusko, & Nicolopoulos, 2001; Huck & Bray, 1976; also see
Thornton & Rupp, 2006, for a review of this literature), assessment centers
are frequently used for selection in the most litigious of settings (e.g., po-
lice and fire departments, Coulton & Feild, 1995; Lowry, 1996). This use
has put AC users on an even higher level of alert to ensure that simulation
content and scoring procedures remain secure and uncompromised over
time. It is also important to note that if selection programs are using more
than one version of a simulation exercise, different degrees of difficulty
in any aspect of the versions introduce potential litigable unfairness into
the selection procedure.
For development. In a developmental context (e.g., Tillema, 1998),
information about the scoring rubrics could be inferred from feedback
(where behavioral examples of effective and ineffective behavior in rela-
tion to the specific simulation exercises are often given) and thus could
allow participants to show dimension improvement when reassessed with
the same exercises that might not necessarily generalize to broader work-
place settings (and thus compromise the validity of the developmental
assessment center to catalyze dimension-level improvement across sit-
uations). One developmental assessment center design, advocated by
Thornton and Rupp (2006), involves having participants first take part
in a preliminary set of simulation exercises that set a baseline for their
proficiency on the dimensions. Following their participation in the first
set of exercises, participants receive detailed feedback and set goals for
improvement. Following this, they take part in a second set of simu-
lation exercises, after which they receive feedback on their short-term
improvement and set more long-term goals. In this context, the criterion
is improvement on dimension proficiency from Time 1 to Time 2. With-
out having established that simulations are of equal difficulty, assessment
center researchers and practitioners have no way of knowing if measured
improvement is due to learning and development or if score change is
due to the Time 2 simulations being easier than or similar to the Time 1
simulations. Likewise, if participants complete identical simulations at
reassessment, then it would not be possible to untangle whether improved
performance is the result of familiarity with the specific exercise content
and scoring procedures or improved dimension proficiency. Establishing
parallel simulations allow investigators to attribute dimension improve-
ment to increased dimension proficiency rather than to exercise mastery.
In summary, whenever there is a large-scale usage of simulation ex-
ercises for evaluation or development, it becomes potentially desirable
to create alternate forms of simulation exercises, and with this practice
BRADLEY J. BRUMMEL ET AL. 143
Parallel Tests
Parallel tests are defined as measures of the same construct that have
equal true scores, standard deviations, item intercorrelations, and simi-
lar factor structures (Cronbach, 1947). Parallel tests are rarely, if ever,
achieved in practice given this definition, and a thorough test of the par-
allelism of any set of measures will almost always support the conclusion
that the tests are not parallel (Traub, 1994). Therefore, professional test
developers use procedures to construct measures that approximate this
definition of parallelism. There are three categories that represent relax-
ations of the assumptions of strict parallelism. Tau-equivalent tests have
equal true scores but unequal error variances. Essentially tau-equivalent
tests have true scores defined by the formula, μ 1 = μ 2 + a, such that the
true score on test 1 is simply the true score on test 2 plus some constant.
Essentially tau-equivalent tests have equal variances. Congeneric tests are
said to be of the same kind, with true scores defined by the formula, μ 1 =
aμ 2 + b. This means that true scores on test 1 are a linear transformation of
true scores on test 2. Congeneric tests have unequal variances. Therefore,
parallel tests have the strictest assumptions, followed by tau-equivalent
tests, essentially tau-equivalent tests, and finally congeneric tests, which
have the most relaxed assumptions.
Constructing parallel measures. Multiple strategies for constructing
parallel measures have been developed, each of which makes certain
assumptions about the nature of the testing situation. Thus, the choice of
which strategy to use depends on the item types, the number of available
items, and the dimensionality of the test.
One strategy is random domain sampling. This strategy requires con-
structing a large pool of items that reflect the construct being tested. The
items are then randomly selected from this item pool for each parallel
version of the test. This method gives each item an equal chance of being
selected for each test version (Nunnally & Bernstein, 1994). If the test ver-
sions contain enough items and the items in the pool are similarly difficult,
then, except for sampling error, this process will construct approximately
parallel tests. Random domain sampling is an effective strategy unless the
test is multidimensional, in which case the method requires separate item
pools for each distinct dimension (Clause et al., 1998). Without separate
item pools, some dimensions of the construct could be underrepresented
144 PERSONNEL PSYCHOLOGY
criteria to determine if the forms have similar true scores, standard de-
viations, item intercorrelations, and factor structures. This is done by
examining the equivalence of the sample statistics beginning with the
means, then the variances, and finally the covariance structures. If the
tests means are too different to assume that the tests are parallel or
tau-equivalent, it may be possible to assume that the test forms con-
form to the more relaxed standards for essentially tau-equivalent or con-
generic tests (Traub, 1994). The means and standard deviations of al-
ternate test forms are compared using t-tests and F-tests. If means and
standard deviations are not found to be significantly different, then the
alternate forms can be assumed to be parallel or tau-equivalent. If this
is the case, scores on alternate forms can be directly compared. If sig-
nificant differences are found, the tests are considered essentially tau-
equivalent or congeneric, and a process of test equating can be used to
adjust test scores, making score comparison across alternate forms more
accurate.
Test equating. Although a number of methods have been proposed
to allow for score comparison across nonequivalent tests (e.g., calibra-
tion, concordance, and projection are other methods; see Angoff, 1971;
Kolen, 2004a), test equating emerges as the most reasonable method for
evaluating the parallelism of alternate test forms in selection and devel-
opment contexts in that in these applications, alternate tests are designed
to measure the same constructs and to make identical inferences in equiv-
alent populations in the same measurement conditions. Test equating is
used to link scores on alternate forms of a test in which the alternate
forms are built to the same content and statistical specifications (Angoff,
1971). The goal of test equating is to produce scores on a measure of
the same construct in the same metric that can be used interchangeably
(Dorans, 2004a). Test equating functions scale the scores on one or more
test forms so that the alternate forms of the test have equivalent means and
distributions. The method assumes that all forms of the tests are reliable
and that the equating functions are population invariant (Angoff, 1971).1
Specific information about calculating equating functions is provided
below.
1
Testing population invariance requires large samples of participants from a variety of
subpopulations for each test version. In many cases, it is not possible to effectively test for
population invariance due to the size of the samples in the testing situation (Dorans, 2004b).
However, it has been rare to find variance in equating equations across subpopulations (i.e.,
genders, races) that exceeds sampling error (Kolen, 2004b). Although our intention is not
to minimize the possibility of population invariance in equating relationships, there are
reasons to be fairly confident that the equating relationship for alternate forms of any test
of the same construct should be approximately population invariant.
146 PERSONNEL PSYCHOLOGY
The question now becomes, how might we apply the standard tech-
niques for constructing parallel items and test equating to behavioral
assessment? Because behavioral simulation exercises measure clearly
defined dimensions, scored on consistent metrics, there are aspects of
these techniques that can be directly applied. However, there are other
characteristics that are germane to behavioral assessment (International
Taskforce, 2000; Thornton & Mueller-Hanson, 2004) that require special
consideration and possibly creative solutions in one’s attempt to estab-
lish parallelism. These complexities include the “shifting” composition
of dimensions assessed across simulation exercises within an assessment
center, the scoring of observed behaviors by assessors, the lack of specif-
ically defined test items, and differing units of analysis depending on the
purpose of the assessment.
Assessment of dimensions. As mentioned above, most behavioral
simulation exercises measure multiple dimensions. That is, it is typical
for 3–6 dimensions to be assessed in a single exercise (Thornton & Rupp,
2006). In fact, this is one of the stated advantages of simulation exer-
cises over other assessment formats (Thornton & Mueller-Hanson, 2004).
However, in assessment centers, although it is necessary to assess each
dimension in more than one exercise (International Taskforce, 2000), it is
often the case not to assess every dimension in every exercise. This fea-
ture is designed to lessen the cognitive load on assessors and to limit the
assessment of dimensions to those that are most appropriately assessed by
a particular exercise. This results in a collection of simulation exercises,
each of which measure a unique combination of dimensions. This design
characteristic adds complexity to the construction of parallel simulation
exercises, because in order to establish parallelism, each alternate form
needs to assess the same set of dimensions.
Therefore, the simulation exercises must be constructed such that the
content expected to elicit behaviors relevant to each assessed dimension
is specified. Alternate versions must match this dimension-level content.
Consequently, when comparing scores across alternate simulation exer-
cises, it is necessary to compare (and equate, if deemed necessary) scores
for each dimension assessed in the simulation exercise. What this also
implies is that in order to create parallel assessment centers, it is neces-
sary to have exercises that are matched in terms of format, content, and
dimensions assessed, and the within-exercise dimension scores must be
compared (and equated, if necessary) within matched sets of exercises.
Observed behaviors. Second, within each simulation exercise, the
scores on the dimensions are not provided directly by the participants as
in most testing formats. Rather, participants respond to the stimuli with
BRADLEY J. BRUMMEL ET AL. 147
behaviors that are then observed, recorded, classified into dimensions, and
rated by assessors (International Taskforce, 2000). Therefore, although
this format introduces assessor variance both within simulations and be-
tween the alternate forms, an advantage to this design characteristic is that
a consistent rating scale can be used across dimensions, exercises, and
participants (Rupp, Gibbons, et al., 2006). Because the participant does
not directly interact with the scoring mechanism in a simulation exercise,
these mechanisms (e.g., BARS) can remain consistent in the alternate
versions. This increases the degree to which the simulation exercises are
constructed to the same content specifications.
“Test” items. The establishment of parallel tests, as they are tradi-
tionally conceived, also requires matching item difficulty across alternate
forms. This is not as straightforward in the construction of parallel simu-
lation exercises, where it is much less clear what constitutes a test “item.”
In these contexts, it is necessary to match the difficulty of eliciting be-
haviors associated with a certain level of proficiency on each dimension
across the alternate simulation exercises. As mentioned earlier, alternate
test forms can be constructed by changing the surface features of the test
items, while keeping the deep structure the same. However, distinguish-
ing what constitutes surface and deep structure in a simulation exercise
is difficult because the stimuli are complex. Although psychometrically
the dimensions act as items, it is difficult to identify what aspects of the
simulation content maps on to what dimensions, and even harder to de-
termine the relative difficulty the stimuli material creates with regard to
each dimension. In other words, ascertaining which structural elements
will actually affect the type and level of behaviors that a participant may
display is difficult in a simulation exercise.
This complexity also raises issues for measuring the reliability of a
simulation exercise. Due to the time constraints and effort involved in con-
ducting simulation exercises, test–retest reliability is not often calculated.
An alpha-type lower-bound reliability estimate also cannot be computed
if there is only one rating provided for each dimension in the simulation
exercise. Assessors are calibrated to observe and rate behaviors reliably,
but there is no easy measure of the reliability of dimension scores. Thus,
for procedures like test equating, sufficient reliability of measurement
must be assumed.
Various units of analysis. A final unique aspect inherent to behavioral
assessment is that the unit of analysis, that is, what the test administrator
considers the “scores” for the purpose of equating across alternate forms,
differs depending on the purpose and design of the assessment program.
Although we are focusing in this paper on simulation exercises that have
dimensions as their focal construct, depending on the structure of the
program, the manner in which dimension ratings are aggregated can differ.
148 PERSONNEL PSYCHOLOGY
TABLE 2
Guidelines for Constructing Alternate Simulation Exercises
μ1 = μ2 + a,
μ1 = (δ1 /δ2 ) ∗ μ2 + a,
Methods
Participants
Procedures
(I-O psychology faculty and doctoral students) who developed and pi-
lot tested the assessment center. An original simulation exercise and its
pair (designed to be parallel) used the exact BARS.
The developmental assessment center program was designed to func-
tion with six participants and six assessors. Assessees and assessors were
rotated such that every participant was assessed by more than one assessor,
and each assessor assessed more than one type of exercise. Participants
received detailed, developmental feedback from an assessor that was as-
signed to them following their participation. In the feedback sessions,
each dimension was discussed in turn. Examples of effective and inef-
fective behavior elicited by the participant in the exercises were provided
for each dimension. The feedback session concluded with the formation
of developmental goals and action steps aimed at improving proficiency
on each of the dimensions. Participating in the full set of exercises and
receiving feedback took an entire 8-hour day.
with role players in whatever style they preferred in both situations: one
at a time, together, in a specific order, or multiple times. The role players
were given two-page profiles of their characters. These characters had sim-
ilar personalities and workplace disputes from their former workplaces:
One was more individualistic and hierarchy oriented whereas the other
one was more collaborative and team oriented. After the detailed specifi-
cations had been agreed upon, the construction of the alternate forms of
these simulations commenced.
Construction of parallel simulation exercises. The detailed simula-
tion specifications were then used to construct an alternate form for each
of the original simulation exercises. The subject matter experts’ experi-
ence with the original exercises provided a working knowledge of the
difficulty of the original exercises—which we found to be essential to this
process. When it was unclear how to apply the detailed specifications to
the alternate form, the original exercise was reexamined for clarification.
Once alternate forms for all six exercises were completed, pilot testing
was conducted to help evaluate how the new simulations would function
in an operational assessment center setting.
As mentioned earlier, the assessment center program was designed
such that a different organization was simulated in the morning and after-
noon. The morning exercises were cast within a luggage manufacturing
company, and the afternoon exercises were cast in two recently merged
computer chip manufacturing companies. As a result, in constructing the
parallel simulation exercises, we also had to design materials describ-
ing two additional organizations in which the morning and afternoon
exercises of our parallel assessment center would be set. The subject mat-
ter expert team decided on service-based organizations (as opposed to
manufacturing-based organization that comprised the original assessment
center) as a means of altering aspects of the surface structure of the origi-
nal exercises (as recommended in the guidelines). Consequently, we cast
the alternate morning exercises in an advertising firm and the afternoon
exercises in two recently merged banks.
Developing the organizational contexts for the exercises began with
each subject matter expert reviewing the organizational contexts from
the original assessment center. They then took one detail of the or-
ganizational background at a time to clone in the new settings. Then,
both the detailed simulation specifications and these new organiza-
tional background materials were used to construct an alternate form
of each original simulation. In addition, the morning and afternoon ex-
ercises, respectively, were compared to each other to check for log-
ical coherence with the other exercises in that block. The full sub-
ject matter expert team met weekly to review progress and discuss the
exercises.
BRADLEY J. BRUMMEL ET AL. 157
To pilot test the exercises, both the original and alternate simulation ex-
ercises were administered to a group of advanced undergraduate students
at a large midwestern university.2 All participants were video recorded.
Experts watched the videos to judge the extent to which they believed that
the difficulty of the alternate forms were similar to the original simula-
tion exercises in terms of eliciting dimension-relevant behaviors. Using
this information, additional adjustments were made to some of the ex-
ercises. Once every possible attempt was made to emulate the detailed
content specifications and difficulties of the original simulation exercises,
the alternate forms were ready to be administered to Sample 2.
Sample comparison. Sample 2 participated in the alternate forms of
the simulation exercises. Once they had completed the exercises, we were
then able to use the results from Sample 2 to evaluate the equivalence
of the alternate forms to the original exercises completed by Sample 1.
To test whether exercise pairs were parallel, we compared the means and
standard deviations of each dimension assessed in each exercise. Further,
to test whether the two assessment centers, as a whole, were parallel, we
compared the across-exercise dimension ratings for the two groups.
Results
2
It should be noted that this pretest population was not the intended audience for the
simulation exercises. However, we have no reason to believe that the lower level of skill
and experience in the pilot sample differentially affected their scores on the original and
alternate forms.
TABLE 3
158
Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Case and Presentation 1
Information seeking 1 98 4.63 1.49 .08 .40 .69 .06 2.48 .12
2 85 4.55 1.21
Problem solving 1 99 4.42 1.33 −.11 −.62 .54 −.09 2.56 .11
2 85 4.54 1.12
Oral communication 1 99 4.64 1.44 .08 .41 .68 .06 16.56 .00
2 85 4.56 1.03
Planning and organizing 1 97 4.30 1.43 −.10 −.50 .62 −.07 2.12 .15
2 85 4.40 1.25
Fairness 1 99 4.32 1.21 .00 .01 1.00 .00 .26 .61
2 85 4.32 1.12
Interview Simulation 1
Information seeking 1 100 4.99 1.34 .68 3.49 .00 .52 .06 .81
2 85 4.31 1.27
PERSONNEL PSYCHOLOGY
Planning and organizing 1 100 4.71 1.30 1.09 5.95 .00 .89 1.68 .20
2 85 3.61 1.15
Leadership 1 100 5.01 1.19 .97 5.05 .00 .76 2.57 .11
2 85 4.04 1.40
Conflict management 1 100 4.99 1.21 .81 4.33 .00 .65 1.06 .31
2 85 4.18 1.29
Fairness 1 100 4.86 1.03 .16 .89 .38 .14 7.40 .01
2 85 4.70 1.36
TABLE 3 (continued)
Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Leaderless Group 1
Problem solving 1 100 4.90 1.19 .72 4.43 .00 .65 5.56 .02
2 85 4.18 1.00
Oral communication 1 100 4.69 1.14 .59 3.79 .00 .57 2.94 .09
2 85 4.11 .92
Leadership 1 100 4.45 1.25 .57 3.22 .00 .48 2.28 .13
2 85 3.88 1.13
Conflict management 1 100 4.82 1.11 .94 5.57 .00 .83 .07 .79
2 85 3.89 1.16
Fairness 1 100 5.33 .98 1.22 8.07 .00 1.19 1.38 .24
2 85 4.11 1.07
Case and Presentation 2
Information seeking 1 100 4.22 1.26 −.18 −1.14 .26 −.17 8.80 .00
2 85 4.40 .92
BRADLEY J. BRUMMEL ET AL.
Problem solving 1 100 4.36 1.23 .31 1.89 .06 .28 5.67 .02
2 85 4.05 1.00
Oral communication 1 100 5.05 1.15 .50 2.97 .00 .44 .31 .58
2 85 4.56 1.11
Planning and organizing 1 100 4.37 1.41 .05 .28 .78 .04 9.09 .00
2 85 4.31 1.14
Cultural adaptability 1 100 4.11 1.51 .14 .67 .50 .10 6.04 .02
2 85 3.97 1.25
159
160
TABLE 3 (continued)
Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Interview Simulation 2
Information seeking 1 100 5.36 1.26 .79 4.40 .00 .65 .38 .54
2 85 4.57 1.15
Planning and organizing 1 100 4.56 1.34 .16 .85 .40 .13 2.18 .14
2 85 4.40 1.21
Leadership 1 100 4.75 1.23 .26 1.44 .15 .21 .75 .39
2 85 4.48 1.24
Conflict management 1 100 4.79 1.25 .48 2.42 .02 .36 1.77 .19
2 85 4.31 1.43
Cultural adaptability 1 100 4.91 1.47 .63 3.15 .00 .46 5.15 .02
2 85 4.28 1.22
Leaderless Group 2
Problem solving 1 100 4.70 1.26 .40 2.36 .02 .35 1.87 .17
2 85 4.30 1.00
Oral communication 1 100 4.92 1.16 .22 1.45 .15 .21 4.98 .03
PERSONNEL PSYCHOLOGY
2 85 4.70 .90
Leadership 1 100 4.58 1.30 .29 1.73 .09 .25 4.38 .04
2 85 4.28 .99
Conflict management 1 100 4.71 1.03 .55 3.66 .00 .55 .00 .98
2 85 4.15 .98
Cultural adaptability 1 100 4.47 1.39 .26 1.43 .16 .21 7.10 .01
2 85 4.21 1.06
TABLE 4
Sample 1 Sample 2 Difference on Across-Exercise Dimension Ratings
Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Information seeking 1 100 4.82 .88 .39 3.09 .00 .46 .19 .66
2 85 4.43 .83
Planning and organizing 1 100 4.52 1.02 .34 2.39 .02 .35 2.04 .16
2 85 4.18 .90
Problem solving 1 100 4.61 .89 .35 2.94 .00 .43 4.69 .03
2 85 4.26 .73
Oral communication 1 100 4.83 .91 .35 2.89 .00 .42 7.05 .01
2 85 4.48 .74
Conflict management 1 100 4.85 .83 .74 5.79 .00 .85 .25 .61
2 85 4.11 .90
BRADLEY J. BRUMMEL ET AL.
Leadership 1 100 4.72 .86 .57 4.42 .00 .65 .01 .92
2 85 4.15 .89
Fairness 1 100 4.84 .72 .46 3.95 .00 .58 2.32 .13
2 85 4.38 .86
Cultural adaptability 1 100 4.50 1.15 .35 2.34 .02 .34 6.50 .01
2 85 4.14 .89
161
162 PERSONNEL PSYCHOLOGY
than did the original version, although size of difference depended on the
dimension assessed. We found the greatest discrepancies for the dimen-
sions leadership (d = .85) and conflict management (d = .65).
Based on these results, we can conclude that neither the exercises nor
the assessment center as whole were parallel or tau-equivalent. Therefore,
it becomes necessary to compute equating functions so that the ratings of
future assessment center candidates participating in the alternate assess-
ment center can have their scores adjusted so that scores resulting from
alternate exercises (or across-exercise dimension ratings from alternate
assessment centers) are directly comparable. This particular issue was of
great importance for the program in which we collected our data. As we
explained above, this was a developmental program, where individuals
would take part in the original set of assessment center exercises, receive
extensive feedback, and work toward developmental goals over the course
of an year. The program designed the parallel version of the assessment
center in order to reassess participants 1 year later to determine their
progress on improving their proficiency on the dimensions. The program
administrators felt it problematic to put participants through the same set
of exercises, in that the participants were provided with detailed feedback
on how they could have more effectively completed the exercises, and the
center was designed to assess global proficiency on the dimensions rather
than task performance surrounding a particular simulated situation. How-
ever, the use of the alternate assessment center stood to be problematic as
well without information on the relative difficulty of the two sets of sim-
ulation exercises. Indeed, if the exercises used for the reassessment were
easier than the Time 1 exercises, then improvement could be inferred when
it was in fact not there. If the exercises were more difficult, then evidence
for true learning and improvement would be masked. Consequently, the
process of comparing the sample statistics from the two samples described
above allowed the program administrators to (a) realize that their alternate
exercises were not parallel and (b) use the data they had collected so far
to compute equating functions. By doing so, the program staff could put
participants through the alternate assessment center at Time 2, adjust their
scores using the equating function, and more accurately determine their
development on the dimensions over time.
To show the impact of this method, we point the readers to
Tables 5 and 6. In Table 5 we provide the equating functions that were
computed using the Sample 1 and Sample 2 across-exercise dimension
ratings. Note that our method for computing the equating function de-
pended on whether we determined the dimension ratings to be essentially
BRADLEY J. BRUMMEL ET AL. 163
TABLE 5
Equating Relationship for the Assessment Center Dimensions
TABLE 6
A Comparison of 22 Reassessed Participants Using Equated and Nonequated
Scores on the Alternate Forms
3
The demographic characteristics of this sample were virtually identical to the charac-
teristics of Samples 1 and 2, described in the Methods section.
164 PERSONNEL PSYCHOLOGY
Discussion
Test Equating
differences, the use of equated scores is still preferable to the use of raw
simulation scores.
4
We did, however, establish interrater agreement within the assessor training program.
Given the extensiveness of our frame-of-reference training, we do not feel we were at risk
of low interrater reliability (compared to the average AC program). However, it would have
been nice for the purpose of this paper to run this calculation due to its importance to
establishing parallelism.
5
We thank the editor for this suggested research design for a more thorough test of
parallelism.
BRADLEY J. BRUMMEL ET AL. 167
Conclusion
REFERENCES
Angoff WH. (1971). Scales, norms, and equivalent scores. In Thorndike RL (Ed.), Educa-
tional measurement (2nd ed., pp. 508–600). Washington, DC: American Council on
Education.
Arthur W Jr., Day EA, McNelly TL, Edens PS. (2003). Effectiveness of training in or-
ganizations: A meta-analysis of design and evaluation features. P ERSONNEL P SY-
CHOLOGY , 56, 125–154.
Brannick MT, Michaels CE, Baker DP. (1989). Construct validity of in-basket scores.
Journal of Applied Psychology, 74, 957–963.
Brostoff M, Meyer HH. (1984). The effects of coaching on in-basket performance. Journal
of Assessment Center Technology, 7, 17–21.
Campion MA, Palmer DK, Campion JE. (1997). A review of structure in the selection
interview. P ERSONNEL P SYCHOLOGY, 50, 655–702.
Cizek GJ. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ:
Erlbaum.
Clause CS, Mullins ME, Nee MT, Pulakos E, Schmitt N. (1998). Parallel test form de-
velopment: A procedure for alternative predictors and an example. P ERSONNEL
P SYCHOLOGY, 51, 193–208.
Connelly BS, Ones DS, Ramesh A, Goff M. (2008). A pragmatic view of assessment center
exercises and dimensions. Industrial and Organizational Psychology: Perspectives
on Science and Practice, 1, 121–124.
Coulton GF, Feild HS. (1995). Using assessment centers in selecting entry-level police
officers: Extravagance or justified expense? Public Personnel Management, 24(2),
223–254.
168 PERSONNEL PSYCHOLOGY
Cronbach LJ. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12,
1–16.
Dennis I, Handley SJ, Bradon P, Evans J, Newstead SE. (2002). Towards a predictive model
of the difficulty of analytical reasoning items. In Irvine SH, Kyllonen PC (Eds.),
Item generation and test development (pp. 53–71). Mahwah, NJ: Erlbaum.
Dorans NJ. (2004a). Equating, concordance, and expectation. Applied Psychological Mea-
surement, 28, 227–246.
Dorans NJ. (2004b). Editor’s introduction: Assessing population sensitivity of equating
functions. Journal of Educational Measurement, 41, 1–2.
Goldstein HW, Yusko KP, Nicolopoulos V. (2001). Exploring Black-White subgroup dif-
ferences of managerial competencies. P ERSONNEL P SYCHOLOGY, 54, 783–807.
Hardison CM, Sackett PR. (2004, April). Assessment center criterion-related validity: A
meta-analytic update. Poster presented at the 19th Annual Conference of the Society
for Industrial and Organizational Psychology, Chicago, IL.
Harris MM, Becker AS, Smith DE. (1993). Does the assessment center scoring method
affect the cross-situational consistency of ratings? Journal of Applied Psychology,
78, 359–378.
Howard A. (2008). Making Assessment centers work the way they are supposed to. In-
dustrial and Organizational Psychology: Perspectives on Science and Practice, 1,
98–104.
Huck JR, Bray DW. (1976). Management assessment centers evaluations and subsequent
job performance of Black and White females. P ERSONNEL P SYCHOLOGY, 26,
13–30.
International Taskforce on Assessment Center Guidelines. (2000). Guidelines and ethical
considerations for assessment center operations. Public Personnel Management, 29,
315–331.
Irvine SH, Dann PL, Anderson JD. (1990). Towards a theory or algorithm-derived cognitive
test structure. British Journal of Psychology, 81, 173–195.
John OP, Srivastava S. (1999). The big-five trait taxonomy: History, measurement, and
theoretical perspectives. In Pervin L, John OP (Eds.), Handbook of personality:
Theory and Research (2nd ed., pp. 102–138). New York: Guilford.
Kolen ML. (2004a). Linking assessments: Concept and history. Applied Psychological
Measurement, 41, 219–226.
Kolen ML. (2004b). Population invariance in equating and linking: Concept and history.
Journal of Educational Measurement, 41, 3–14.
Kolk NJ, Born MP, Van der Flier H. (2002). Impact of common rater variance on construct
validity of assessment center dimension judgments. Human Performance, 15, 325–
338.
Krause DE, Thornton GC III. (2007, April). A comparison of assessment center practices in
Western Europe and North America. Poster presented at the 22nd Annual Conference
of the Society of Industrial Organizational Psychology, New York.
Kudisch JD, Avis JM, Fallon JD, Thibodeaux HF, Roberst FE, Rollier TJ, et al. (2001,
April). A survey of assessment center practices worldwide: Maximizing innovation
or business as usual? Paper presented at the 16th Annual Conference of the Society
for Industrial Organizational Psychology, San Diego, CA.
Lance C. (2008). Why traditional explanations for assessment center validity are wrong.
Industrial and Organizational Psychology: Perspectives on Science and Practice,
1, 84–97.
Lance CE, Lambert TA, Gewin AG, Lievens F, Conway JM. (2004). Revised es-
timates of dimension and exercise variance components in assessment center
post-exercise dimension ratings. Journal of Applied Psychology, 89(2), 377–
385.
BRADLEY J. BRUMMEL ET AL. 169
Lievens F, Anseel F. (2008). Creating alternate in-basket forms through cloning: Some
preliminary results. International Journal of Selection and Assessment, 15, 428–
433.
Lievens F, Sackett PR. (2007). Situational judgment tests in high-stakes settings: Issues
and strategies with generating alternate forms. Journal of Applied Psychology, 92,
1043–1055.
Lowry PE. (1996). A survey of the assessment center process in the public sector. Public
Personnel Management, 25(3), 307–321.
McLeod L, Lewis C, Thissen D. (2003). A Bayesian method for the detection of item pre-
knowledge in computerized adaptive testing. Applied Psychological Measurement,
27(2), 121–137.
Moses JL, Ritchie RJ. (1976). Supervisory relationship training: A behavioral evalua-
tion of a behavioral modeling program. P ERSONNEL P SYCHOLOGY, 29, 337–
343.
Nunnally JC, Bernstein IH. (1994). Psychometric theory (3rd ed.). New York: McGraw-
Hill.
O’Brien ML. (1989). Psychometric issues relevant to selecting items and assembling par-
allel forms of language proficiency instruments. Educational & Psychological Mea-
surement, 49, 347–353.
Oswald FL, Friede AJ, Schmitt N, Kim BK, Ramsey LJ. (2005). Extending a practical
method for developing alternate test forms using independent sets of items. Orga-
nizational Research Methods, 8, 149–164.
Petty MM. (1974). A multivariate analysis of the effects of experience and training upon
performance in a leaderless group discussion. P ERSONNEL P SYCHOLOGY, 27, 271–
282.
Rottinghaus PJ, Betz NE, Borgen FH. (2003). Validity of parallel measures of vocational
interests and confidence. Journal of Career Assessment, 11, 355–378.
Rupp DE, Gibbons AM, Baldwin AM, Snyder LA, Spain SM, Woo SE, et al. (2006). An
initial validation of developmental assessment centers as accurate assessments and
effective training interventions. Psychologist-Manager Journal, 9, 171–200.
Rupp DE, Snyder LA, Gibbons AM, Thornton GC III. (2006). What should managerial
developmental assessment centers be developing? Psychologist-Manager Journal,
9, 75–98.
Rupp DE, Thornton GC, Gibbons AM. (2008). The construct validity of the assess-
ment center method and usefulness of dimensions as focal constructs. Industrial
and Organizational Psychology: Perspectives on Research and Practice, 1, 116–
120.
Sackett PR, Dreher GF. (1982). Constructs and assessment center dimensions: Some trou-
bling empirical findings. Journal of Applied Psychology, 67(4), 401–410.
Schleicher DJ, Day DV, Mayes BT, Riggio RE. (2002). A new frame for frame-of-reference
training: Enhancing the construct validity of assessment centers. Journal of Applied
Psychology, 87(4), 735–746.
Schmidt FL, Hunter JE. (1998). The validity and utility of selection methods in personnel
psychology: Practical and theoretical implications of 85 years of research findings.
Psychological Bulletin, 124, 262–274.
Schmidt FL, Rader M. (1999). Exploring the boundary conditions for interview validity:
Meta-analytic validity findings for a new interview type. P ERSONNEL P SYCHOL -
OGY , 52, 445–464.
Schneider JR, Schmitt N. (1992). An exercise design approach to understanding assessment
center dimension and exercise constructs. Journal of Applied Psychology, 77, 32–41.
Shavelson RJ, Webb NM. (1991). Generalizability theory: A primer. Thousand Oaks, CA:
Sage.
170 PERSONNEL PSYCHOLOGY
Smith PC, Kendall LM. (1963). Retranslation of expectations: An approach to the con-
struction of unambiguous anchors for rating scales. Journal of Applied Psychology,
47(2), 149–155.
Spychalski AC, Quiñones MA, Gaugler BB, Pohley K. (1997). A survey of assessment
center practices in organizations in the United States. P ERSONNEL P SYCHOLOGY,
50, 71–90.
Thornton GC III, Mueller-Hanson RA. (2004). Developing organizational simulations: A
guide for practitioners and students. Mahwah, NJ: Erlbaum.
Thornton GC III, Rupp DE. (2003). Simulations and assessment centers. In Thomas JC
(Ed.), Hersen M (Series Ed.), Comprehensive handbook of psychological assess-
ment, Vol. 4: Industrial and organizational assessment. New York: Wiley.
Thornton GC III, Rupp DE. (2006). Assessment centers in human resource management.
Mahwah, NJ: Erlbaum.
Tillema HH. (1998). Assessment of potential, from assessment centers to development
centers. International Journal of Selection and Assessment, 6(3), 185–191.
Traub RE. (1994). Reliability for the social sciences: Theory and applications (Volume 3).
Thousand Oaks, CA: Sage.
van der Linden, WJ, Adema JJ. (1998). Simultaneous assembly of multiple test forms.
Journal of Educational Measurement, 35, 185–198.
Van Iddekinge CH, Raymark PH, Eidson CE Jr., Attenweiler WJ. (2004). What do structured
interviews really measure? The construct validity of behavior description interviews.
Human Performance, 17, 71–93.
Wonderlic & Associates, Inc. (1983). Wonderlic Personnel Test.
Zedeck S. (1986). A process analysis of the assessment center method. Research in Orga-
nizational Behavior, 8, 259–296.