Sei sulla pagina 1di 35

PERSONNEL PSYCHOLOGY

2009, 62, 137–170

CONSTRUCTING PARALLEL SIMULATION


EXERCISES FOR ASSESSMENT CENTERS AND
OTHER FORMS OF BEHAVIORAL ASSESSMENT
BRADLEY J. BRUMMEL, DEBORAH E. RUPP, AND SETH M. SPAIN
University of Illinois at Urbana-Champaign

Assessment centers rely on multiple, carefully constructed behavioral


simulation exercises to measure individuals on multiple performance
dimensions. Although methods for establishing parallelism among al-
ternate forms of paper-and-pencil tests have been well researched (i.e., to
equate tests on difficulty such that the scores can be compared), little re-
search has considered the why and how of parallel simulation exercises.
This paper extends established procedures for constructing parallel test
forms to dimension-based behavioral simulations. We discuss reasons
for establishing comparable, alternate simulation forms and discuss the
issues raised when applying traditional procedures to simulation exer-
cises. After proposing a set of guidelines for establishing alternate forms
among simulations, we apply these guidelines to simulations used in an
operational assessment center.

Many organizations use multiple test forms in their testing programs


(Dorans, 2004a). Long test windows, repeated testing opportunities, and
testing over time increase the possibility of item preknowledge by test
participants (Kolen, 2004a). Item preknowledge occurs when test takers
learn of actual test content prior to taking the test. Item preknowledge
introduces construct-irrelevant variance and can therefore influence test
scores. This preknowledge threatens the test validity and the inferences
made from test scores. These issues have lead to the development of
formalized procedures for constructing parallel tests (e.g., O’Brien, 1989;
Rottinghaus, Betz, & Borgen, 2003; van der Linden & Adema, 1998).
Although necessary, important, and widely in use, these formal procedures
are limited in that they only directly apply to tests with few response
options and objective scoring methods. Although some extensions to these
procedures have been made, including applications for biodata, situational

Portions of this research were presented at the 20th Annual Conference of the Society
for Industrial and Organizational Psychology, Inc., Los Angeles, CA. The authors thank
Alyssa Gibbons, Amanda Farthing, Sang Woo, Carra Sims, and Myungjoon Kim for their
assistance with this project. Bradley J. Brummel is now at the University of Tulsa.
Correspondence and requests for reprints should be addressed to Deborah E. Rupp,
Department of Psychology, 603 E. Daniel St., University of Illinois at Urbana-Champaign,
Champaign, IL, 61820; derupp@uiuc.edu.

C 2009 Wiley Periodicals, Inc.

137
138 PERSONNEL PSYCHOLOGY

judgment tests, and in-basket exercises (Clause, Mullins, Nee, Pulakos, &
Schmitt, 1998; Lievens & Anseel, 2008; Lievens & Sackett, 2007; Oswald,
Friede, Schmitt, Kim, & Ramsey, 2005), there is currently no formal
protocol for establishing parallelism among behavioral assessments (e.g.,
those scored by assessors, such as simulation exercises in assessment
centers). This paper makes an initial attempt to apply and extend standard
test development procedures into this domain.
In the sections that follow, we more fully describe the behavioral
assessment context. Then, we review the test equating literature as it ap-
plies to traditional test formats and highlight aspects of behavioral assess-
ment for which the traditional protocol cannot be applied. Given these
gaps, we propose extensions of established procedures for these con-
texts. Finally, using data from an operational assessment center, we apply
these guidelines and evaluate the effectiveness of the process. We con-
clude with a discussion of the limitations of our application and provide
suggestions for researchers and practitioners attempting to construct and
equate dimension-based behavioral simulations in the future.

Dimension-Based Behavioral Assessment

Behavioral simulation exercises have a long history as a method for


measuring complex human skills and abilities (Thornton & Rupp, 2003).
They are used to assess overt behavior prompted by complex stimuli in
order to measure both procedural skills and declarative knowledge (Thorn-
ton & Rupp, 2006). Procedural skills are difficult to measure with more
conventional paper-and-pencil tests and therefore often require the high
fidelity formats that simulation exercises provide (Thornton & Mueller-
Hanson, 2004). The term assessment center (AC) is used in line with
the International Taskforce on Assessment Center Guidelines (2000) to
describe a collection of behavioral simulation exercises and other tests
that are administered collectively. The broader term behavioral assess-
ment is used to refer to a larger class of measures including work sample
tests and interviews. Although other forms of dimension-based behav-
ioral assessment are not the specific focus of this paper, any behavioral
assessment that is used to evaluate performance dimensions could apply
the procedures outlined in this paper.
The skills and abilities displayed in behavioral simulation exer-
cises are often evaluated by categorizing overt behaviors into focal
performance dimensions, often termed behavioral dimensions in the lit-
erature (Rupp, Snyder, Gibbons, & Thornton, 2006). Example dimen-
sions include decision making, planning and organizing, and informa-
tion seeking (Thornton & Mueller-Hanson, 2004; also see Table 1).
Multiple dimensions are often assessed in a single simulation exercise,
BRADLEY J. BRUMMEL ET AL. 139

TABLE 1
Dimensions Assessed

Dimension
Problem solving • Problem understanding
• Thinking solutions through
• Decisiveness
Oral communication • Verbal/nonverbal expression
• Message clarity
• Appropriate communication style
Leadership • Guidance of others
• Balance of needs
• Personal effectiveness
Conflict management • Effective strategies
• Handling conflict
• Constructive solutions
Information seeking • Use of multiple sources
• Situational relevance
• Creation of usable patterns
Planning and organizing • Goal setting
• Allocation of time and resources
• Monitoring and conducting planned activities
Fairness • Interpersonal sensitivity
• Appropriate outcomes
• Executing processes
Cultural adaptability • Understand cultural differences
• Culturally sensitive judgments
• Culturally appropriate communication

and dimensions can be further decomposed in subdimensions for more


refined measurement.
Simulation exercises can take on multiple formats. The basic require-
ment is that a candidate is presented with situational stimuli requiring
action on his or her part. The simulation is designed such that the situation
elicits behaviors that are indicative of the job-relevant performance dimen-
sions of interest. Common simulation formats include leaderless group
discussions, in-baskets, case study analyses with presentations, and inter-
view simulations. These different formats are designed to elicit behaviors
indicative of specific performance dimensions such as problem solving,
oral communication, conflict resolution, and the like (e.g., Thornton &
Mueller-Hanson, 2004; Thornton & Rupp, 2006; Zedeck, 1986).
Candidates taking part in behavioral simulation exercises are typi-
cally observed by trained assessors who take detailed behavioral obser-
vation notes during the exercises. Following the exercise, the behavioral
140 PERSONNEL PSYCHOLOGY

observations are categorized into the dimensions of interest. Assessors


then use behavioral checklists (Harris, Becker, & Smith, 1993; Schneider
& Schmitt, 1992) or behaviorally anchored rating scales (BARS; Smith
& Kendall, 1963) to score the candidate on each dimension. Using ei-
ther a consensus or statistical integration method, these ratings are often
integrated across exercises to form overall dimension ratings and, if ap-
propriate, an overall assessment rating (Zedeck, 1986).
Although the multiple simulation technique applied by assessment
centers has been found to be an effective tool for predicting future per-
formance (Hardison & Sackett, 2004), diagnosing training needs (Arthur,
Day, McNelly, & Edens, 2003), and providing opportunities for expe-
riential learning (Rupp, Snyder, et al., 2006), there exists a debate in
the literature as to whether the method truly allows for the assessment
of the intended dimensions or whether simulations are better suited for
assessing unitary task proficiency such as running a meeting or giving
a presentation (e.g., Lance, 2008; Lance, Lambert, Gewin, Lievens, &
Conway, 2004; Sackett & Dreher, 1982; Schleicher, Day, Mayes, &
Riggio, 2002). Although these arguments have caused decades of debate
among assessment center scholars, more recent research has criticized the
methods in which construct validity critics have conducted their studies
and the lack of a body of empirical evidence supporting the validity of
assessment centers that use alternate constructs such as tasks (Connelly,
Ones, Ramesh, & Goff, 2008; Howard, 2008; Kolk, Born, & Van der
Flier, 2002; Thornton & Rupp, 2006). For example, research has shown
that when one incorporates the modern view of validity, when one uses
across-exercise dimension ratings rather than within-exercise dimension
ratings, when one takes a nomological network approach involving exter-
nal correlates rather than multitrait–multimethod approach, and when one
incorporates multiple dimension (item) ratings and uses a factor analytic
approach, it becomes quite easy to confirm that using dimensions as the
focal constructs in assessment centers is quite an effective strategy (see
Rupp, Thornton, & Gibbons, 2008, for a review of this research).

The Need for Parallel Simulation Exercises

Simulation exercises are used for both selection and development, and
we argue that parallel simulations are potentially desirable in both con-
texts. Parallel versions of simulation exercises may be useful for adjusting
content to different settings or updating simulations with changes in tech-
nology or language. It has also been suggested that constructing alternate
forms of simulation exercises may be more cost effective than constructing
new simulation exercises from scratch (Lievens & Anseel, 2008). How-
ever, the most pressing reason for creating alternate forms of simulation
BRADLEY J. BRUMMEL ET AL. 141

exercises (as well as for tests of any kind) is the possible influence of
preknowledge on validity. Concerns regarding content compromise and
preknowledge have been documented in the literature (Lievens & Anseel,
2008; Lievens & Sackett, 2007), and evidence has been presented showing
how coaching can affect simulation scores (Brannick, Michaels, & Baker,
1989; Brostoff & Meyer, 1984; Moses & Ritchie, 1976; Petty, 1974).
However, we are unaware of any specific research that directly assesses
the effect of simulation preknowledge on validity. At this point it remains
a theoretical possibility.
In some ways, preknowledge of the specific content making up simula-
tion exercises could affect an individual’s performance in a similar manner
to what has been described for other testing formats (cf. McLeod, Lewis,
& Thissen, 2003). If an individual’s scores on the dimensions assessed in
a simulation are artificially inflated due to specific knowledge of, or ex-
perience with, the content of the simulation exercise, then the scores will
not accurately reflect the level of proficiency for that individual on those
dimensions. In such a situation, the validity of inferences made from these
scores would be questionable. However, there are also some aspects of
simulation exercise preknowledge that are unique to this method. That is,
with regard to traditional paper-and-pencil-based tests, if test takers gain
access to items, they can determine the answers and memorize both the
items and correct responses prior to a second test administration (if paral-
lel versions are not used), or share this information with future test takers
(Cizek, 1999). The case is somewhat different in assessment center con-
texts. In assessment center programs, it is common to inform participants
ahead of time of the types of exercises that will be used and the dimensions
that will be assessed (indeed, this is common in traditional high-stakes
testing as well). What assessment center participants do not have access
to are the stimulus materials making up the exercises and the scoring
rubrics used to rate them on the dimensions. Stimulus material includes
the actual information that is provided to the participants at the start of
the exercises, which depends completely on the situation being simulated.
It might include memos and letters, personnel files, descriptions of the
situation that the participant must deal with, and so on. Scoring rubrics
include the BARS or behavioral checklists that define what constitutes
proficiency on the dimensions in the particular situation being simulated.
For selection/promotion. In a selection/promotion context, if the stim-
uli material is compromised, participants would have the opportunity to
practice prior to taking part in the simulation exercises, therefore in-
flating their scores and reducing the criterion-related validity of the as-
sessment. Thus, as the number of assessment centers implemented in
organizations continues to increase worldwide, issues of test security
become increasingly important (Krause & Thornton, 2007; Kudisch et al.,
142 PERSONNEL PSYCHOLOGY

2001; Spychalski, Quiñones, Gaugler, & Pohley, 1997). Because they have
been established as a method producing scores with relatively substantial
predictive validity, while at the same time minimizing adverse impact
(Goldstein, Yusko, & Nicolopoulos, 2001; Huck & Bray, 1976; also see
Thornton & Rupp, 2006, for a review of this literature), assessment centers
are frequently used for selection in the most litigious of settings (e.g., po-
lice and fire departments, Coulton & Feild, 1995; Lowry, 1996). This use
has put AC users on an even higher level of alert to ensure that simulation
content and scoring procedures remain secure and uncompromised over
time. It is also important to note that if selection programs are using more
than one version of a simulation exercise, different degrees of difficulty
in any aspect of the versions introduce potential litigable unfairness into
the selection procedure.
For development. In a developmental context (e.g., Tillema, 1998),
information about the scoring rubrics could be inferred from feedback
(where behavioral examples of effective and ineffective behavior in rela-
tion to the specific simulation exercises are often given) and thus could
allow participants to show dimension improvement when reassessed with
the same exercises that might not necessarily generalize to broader work-
place settings (and thus compromise the validity of the developmental
assessment center to catalyze dimension-level improvement across sit-
uations). One developmental assessment center design, advocated by
Thornton and Rupp (2006), involves having participants first take part
in a preliminary set of simulation exercises that set a baseline for their
proficiency on the dimensions. Following their participation in the first
set of exercises, participants receive detailed feedback and set goals for
improvement. Following this, they take part in a second set of simu-
lation exercises, after which they receive feedback on their short-term
improvement and set more long-term goals. In this context, the criterion
is improvement on dimension proficiency from Time 1 to Time 2. With-
out having established that simulations are of equal difficulty, assessment
center researchers and practitioners have no way of knowing if measured
improvement is due to learning and development or if score change is
due to the Time 2 simulations being easier than or similar to the Time 1
simulations. Likewise, if participants complete identical simulations at
reassessment, then it would not be possible to untangle whether improved
performance is the result of familiarity with the specific exercise content
and scoring procedures or improved dimension proficiency. Establishing
parallel simulations allow investigators to attribute dimension improve-
ment to increased dimension proficiency rather than to exercise mastery.
In summary, whenever there is a large-scale usage of simulation ex-
ercises for evaluation or development, it becomes potentially desirable
to create alternate forms of simulation exercises, and with this practice
BRADLEY J. BRUMMEL ET AL. 143

comes the requirement of statistically establishing parallelism between


these alternate forms. Unfortunately, there currently exist no guidelines
for carrying out such a practice. As a first step, we look to the literature
on establishing parallelism between alternate forms of traditional (i.e.,
nonbehavioral) paper-and-pencil forms of assessment.

Parallel Tests

Parallel tests are defined as measures of the same construct that have
equal true scores, standard deviations, item intercorrelations, and simi-
lar factor structures (Cronbach, 1947). Parallel tests are rarely, if ever,
achieved in practice given this definition, and a thorough test of the par-
allelism of any set of measures will almost always support the conclusion
that the tests are not parallel (Traub, 1994). Therefore, professional test
developers use procedures to construct measures that approximate this
definition of parallelism. There are three categories that represent relax-
ations of the assumptions of strict parallelism. Tau-equivalent tests have
equal true scores but unequal error variances. Essentially tau-equivalent
tests have true scores defined by the formula, μ 1 = μ 2 + a, such that the
true score on test 1 is simply the true score on test 2 plus some constant.
Essentially tau-equivalent tests have equal variances. Congeneric tests are
said to be of the same kind, with true scores defined by the formula, μ 1 =
aμ 2 + b. This means that true scores on test 1 are a linear transformation of
true scores on test 2. Congeneric tests have unequal variances. Therefore,
parallel tests have the strictest assumptions, followed by tau-equivalent
tests, essentially tau-equivalent tests, and finally congeneric tests, which
have the most relaxed assumptions.
Constructing parallel measures. Multiple strategies for constructing
parallel measures have been developed, each of which makes certain
assumptions about the nature of the testing situation. Thus, the choice of
which strategy to use depends on the item types, the number of available
items, and the dimensionality of the test.
One strategy is random domain sampling. This strategy requires con-
structing a large pool of items that reflect the construct being tested. The
items are then randomly selected from this item pool for each parallel
version of the test. This method gives each item an equal chance of being
selected for each test version (Nunnally & Bernstein, 1994). If the test ver-
sions contain enough items and the items in the pool are similarly difficult,
then, except for sampling error, this process will construct approximately
parallel tests. Random domain sampling is an effective strategy unless the
test is multidimensional, in which case the method requires separate item
pools for each distinct dimension (Clause et al., 1998). Without separate
item pools, some dimensions of the construct could be underrepresented
144 PERSONNEL PSYCHOLOGY

in alternate test forms. This strategy is also limited to testing situations in


which a large number of items can be constructed and objectively scored.
If the testing format violates these limitations, then other methods for
constructing parallel measures must be used.
A second strategy, content specification, which overcomes some of
the limitations of random domain sampling, involves constructing tests
to strict content specifications. That is, detailed test specifications are
created indicating the number and types of items to be written, and the
specific content and level of difficulty to be tapped by each item or sets
of items. These content specifications ensure that all test forms contain
similar item types and cover the same domain of knowledge. Parallel tests
are constructed by changing the surface features of the test (also referred
to as incidentals or nuisance factors) while carefully maintaining the
deep structure (also referred to as radicals or controlling factors; Dennis,
Handley, Bradon, Evans, & Newstead, 2002; Irvine, Dann, & Anderson,
1990). Using this technique, a test developer can generate approximately
parallel forms by constructing test forms that are similar in terms of their
content, difficulty, and psychometric properties while differing only in
incidental details. This method works for almost any item type and does
not require a large pool of items. However, for very short tests, unstructured
tests with multidimensional items, or items in which the structure of the
item is vital to the measurement of the construct, it may be difficult to
construct parallel measures without copying the specific structure of each
item.
A third strategy, item cloning, can be used when the test items them-
selves are multidimensional. In this case, Clause et al. (1998) suggest
cloning items from the initial measure for the construction of the parallel
form(s). The parallel items should maintain the multidimensional nature
of the original items but be different enough in content to constitute their
alternative nature. Item cloning can occur at either an item isomorphic or
an incident isomorphic level (Lievens & Sackett, 2007). Item isomorphic
cloning changes only cosmetic features like the specific words used in an
item (e.g., Clause et al., 1998). Incident isomorphic cloning uses the same
critical incident but allows more changes to the features of the items (e.g.,
Lievens & Anseel, 2008; Lievens & Sackett, 2007). Both of these types
of cloned items would retain more of the surface features of the original
measure as compared to alternative items developed using content speci-
fications. This method may be necessary when the test specifications are
not sufficient to construct an alternate form due to something in the nature
of the test such as complex, interdependent item types.
Evaluating parallelism. Once an appropriate strategy for construct-
ing parallel items has been chosen, it can be implemented. Then the
success of the process can be evaluated against the Cronbach (1947)
BRADLEY J. BRUMMEL ET AL. 145

criteria to determine if the forms have similar true scores, standard de-
viations, item intercorrelations, and factor structures. This is done by
examining the equivalence of the sample statistics beginning with the
means, then the variances, and finally the covariance structures. If the
tests means are too different to assume that the tests are parallel or
tau-equivalent, it may be possible to assume that the test forms con-
form to the more relaxed standards for essentially tau-equivalent or con-
generic tests (Traub, 1994). The means and standard deviations of al-
ternate test forms are compared using t-tests and F-tests. If means and
standard deviations are not found to be significantly different, then the
alternate forms can be assumed to be parallel or tau-equivalent. If this
is the case, scores on alternate forms can be directly compared. If sig-
nificant differences are found, the tests are considered essentially tau-
equivalent or congeneric, and a process of test equating can be used to
adjust test scores, making score comparison across alternate forms more
accurate.
Test equating. Although a number of methods have been proposed
to allow for score comparison across nonequivalent tests (e.g., calibra-
tion, concordance, and projection are other methods; see Angoff, 1971;
Kolen, 2004a), test equating emerges as the most reasonable method for
evaluating the parallelism of alternate test forms in selection and devel-
opment contexts in that in these applications, alternate tests are designed
to measure the same constructs and to make identical inferences in equiv-
alent populations in the same measurement conditions. Test equating is
used to link scores on alternate forms of a test in which the alternate
forms are built to the same content and statistical specifications (Angoff,
1971). The goal of test equating is to produce scores on a measure of
the same construct in the same metric that can be used interchangeably
(Dorans, 2004a). Test equating functions scale the scores on one or more
test forms so that the alternate forms of the test have equivalent means and
distributions. The method assumes that all forms of the tests are reliable
and that the equating functions are population invariant (Angoff, 1971).1
Specific information about calculating equating functions is provided
below.

1
Testing population invariance requires large samples of participants from a variety of
subpopulations for each test version. In many cases, it is not possible to effectively test for
population invariance due to the size of the samples in the testing situation (Dorans, 2004b).
However, it has been rare to find variance in equating equations across subpopulations (i.e.,
genders, races) that exceeds sampling error (Kolen, 2004b). Although our intention is not
to minimize the possibility of population invariance in equating relationships, there are
reasons to be fairly confident that the equating relationship for alternate forms of any test
of the same construct should be approximately population invariant.
146 PERSONNEL PSYCHOLOGY

Establishing Parallelism Between Behavioral Simulation Exercises

The question now becomes, how might we apply the standard tech-
niques for constructing parallel items and test equating to behavioral
assessment? Because behavioral simulation exercises measure clearly
defined dimensions, scored on consistent metrics, there are aspects of
these techniques that can be directly applied. However, there are other
characteristics that are germane to behavioral assessment (International
Taskforce, 2000; Thornton & Mueller-Hanson, 2004) that require special
consideration and possibly creative solutions in one’s attempt to estab-
lish parallelism. These complexities include the “shifting” composition
of dimensions assessed across simulation exercises within an assessment
center, the scoring of observed behaviors by assessors, the lack of specif-
ically defined test items, and differing units of analysis depending on the
purpose of the assessment.
Assessment of dimensions. As mentioned above, most behavioral
simulation exercises measure multiple dimensions. That is, it is typical
for 3–6 dimensions to be assessed in a single exercise (Thornton & Rupp,
2006). In fact, this is one of the stated advantages of simulation exer-
cises over other assessment formats (Thornton & Mueller-Hanson, 2004).
However, in assessment centers, although it is necessary to assess each
dimension in more than one exercise (International Taskforce, 2000), it is
often the case not to assess every dimension in every exercise. This fea-
ture is designed to lessen the cognitive load on assessors and to limit the
assessment of dimensions to those that are most appropriately assessed by
a particular exercise. This results in a collection of simulation exercises,
each of which measure a unique combination of dimensions. This design
characteristic adds complexity to the construction of parallel simulation
exercises, because in order to establish parallelism, each alternate form
needs to assess the same set of dimensions.
Therefore, the simulation exercises must be constructed such that the
content expected to elicit behaviors relevant to each assessed dimension
is specified. Alternate versions must match this dimension-level content.
Consequently, when comparing scores across alternate simulation exer-
cises, it is necessary to compare (and equate, if deemed necessary) scores
for each dimension assessed in the simulation exercise. What this also
implies is that in order to create parallel assessment centers, it is neces-
sary to have exercises that are matched in terms of format, content, and
dimensions assessed, and the within-exercise dimension scores must be
compared (and equated, if necessary) within matched sets of exercises.
Observed behaviors. Second, within each simulation exercise, the
scores on the dimensions are not provided directly by the participants as
in most testing formats. Rather, participants respond to the stimuli with
BRADLEY J. BRUMMEL ET AL. 147

behaviors that are then observed, recorded, classified into dimensions, and
rated by assessors (International Taskforce, 2000). Therefore, although
this format introduces assessor variance both within simulations and be-
tween the alternate forms, an advantage to this design characteristic is that
a consistent rating scale can be used across dimensions, exercises, and
participants (Rupp, Gibbons, et al., 2006). Because the participant does
not directly interact with the scoring mechanism in a simulation exercise,
these mechanisms (e.g., BARS) can remain consistent in the alternate
versions. This increases the degree to which the simulation exercises are
constructed to the same content specifications.
“Test” items. The establishment of parallel tests, as they are tradi-
tionally conceived, also requires matching item difficulty across alternate
forms. This is not as straightforward in the construction of parallel simu-
lation exercises, where it is much less clear what constitutes a test “item.”
In these contexts, it is necessary to match the difficulty of eliciting be-
haviors associated with a certain level of proficiency on each dimension
across the alternate simulation exercises. As mentioned earlier, alternate
test forms can be constructed by changing the surface features of the test
items, while keeping the deep structure the same. However, distinguish-
ing what constitutes surface and deep structure in a simulation exercise
is difficult because the stimuli are complex. Although psychometrically
the dimensions act as items, it is difficult to identify what aspects of the
simulation content maps on to what dimensions, and even harder to de-
termine the relative difficulty the stimuli material creates with regard to
each dimension. In other words, ascertaining which structural elements
will actually affect the type and level of behaviors that a participant may
display is difficult in a simulation exercise.
This complexity also raises issues for measuring the reliability of a
simulation exercise. Due to the time constraints and effort involved in con-
ducting simulation exercises, test–retest reliability is not often calculated.
An alpha-type lower-bound reliability estimate also cannot be computed
if there is only one rating provided for each dimension in the simulation
exercise. Assessors are calibrated to observe and rate behaviors reliably,
but there is no easy measure of the reliability of dimension scores. Thus,
for procedures like test equating, sufficient reliability of measurement
must be assumed.
Various units of analysis. A final unique aspect inherent to behavioral
assessment is that the unit of analysis, that is, what the test administrator
considers the “scores” for the purpose of equating across alternate forms,
differs depending on the purpose and design of the assessment program.
Although we are focusing in this paper on simulation exercises that have
dimensions as their focal construct, depending on the structure of the
program, the manner in which dimension ratings are aggregated can differ.
148 PERSONNEL PSYCHOLOGY

If the goal is to construct parallel simulation exercises, then it is necessary


to use individual dimension scores resulting from the alternate simulation
exercises for comparison. If the goal is to construct parallel assessment
centers, then it is necessary not only to ensure that dimension scores
resulting from alternate simulation exercise forms are equivalent but also
that the across-exercise dimension scores are equivalent. If the assessment
center is for training, development, or the diagnosis of training needs, the
analyses can stop here in that overall dimension ratings are the focal
constructs in such applications. However, if the assessment center is for
prediction (i.e., selection or promotion), then it is also necessary to ensure
that the overall assessment ratings (OARs, the aggregation of across-
exercise dimension scores) are also equivalent.
Summary. Although behavioral simulation exercises can be devel-
oped according to strict specifications, carried out in a standardized man-
ner, and scored by well-trained assessors who display sufficient interrater
reliability, there are complexities inherent to the method itself that make
the establishment of parallel simulation exercises unique and challeng-
ing. Simulation exercises measure multiple discrete dimensions, which
are often rated only once per exercise and are rated by assessors rather
than the test takers themselves. Considering the issues with developing
and equating tests and the complexities involved with behavioral simula-
tion exercises, we now suggest guidelines for constructing and equating
behavioral simulation exercises.

Guidelines for Creating Parallel Simulation Exercises

The construction of comparable behavioral simulation exercises can


be accomplished through a systematic development process that takes into
consideration both the nature of simulation exercises as well as the require-
ments of establishing parallelism between test forms. Here we propose
guidelines for extending the principles of alternate test form construction
to the behavioral simulation domain. These guidelines should provide di-
rection for researchers and practitioners in constructing alternate forms
of behavioral simulation exercises that fit the theoretical definition re-
quired for establishing parallelism and equating, and as a result, allow for
the comparison of individuals across alternate simulation exercise forms.
Alternate forms of simulation exercises could be constructed simultane-
ously, or alternatively, as needed throughout the life of a testing program.
These guidelines will focus on the more likely case that the alternate
forms are being constructed after the initial simulations have already been
constructed or are in use (due to the time, cost, and effort involved in
simulation construction). Whereas a full description of how to develop
simulation exercises is beyond the scope of this paper (see Thornton &
BRADLEY J. BRUMMEL ET AL. 149

TABLE 2
Guidelines for Constructing Alternate Simulation Exercises

Step 1: Construct detailed simulation specifications


1. Obtain original simulations specifications
2. Construct more detailed specifications from the original simulation
Step 2: Construct alternate simulation forms(s)
1. Use detailed specifications to construct alternate forms
2. Reference original simulation form for specific situations
3. Consult simulation experts for equivalence of opportunity to display behaviors
4. Pretest alternate form for equivalence of opportunity to display behaviors
Step 3: Equate alternate simulation forms
1. Use alternate forms of simulations in an operational setting
2. Compare populations means and standard deviations of dimension scores
(within or across exercises depending on purpose)
3. Examine other possible sources of variance in simulation forms other than
difficulty
4. Determine the relationship between test forms (parallel, tau-equivalent,
essentially tau-equivalent, congeneric)
5. Estimate test equating functions for each dimension
6. Use equating functions to adjust scores of future samples taking the alternate
simulation forms

Mueller-Hanson, 2004 for a comprehensive treatment of this topic), we


will focus our description on the elements specific to the construction of
the alternate simulation exercise forms. We organize our recommendations
around three general steps, which are listed in Table 2.
Step 1: Construct detailed simulation specifications. The first step
in building any test is to construct the content specifications based on
the purpose of the assessment. For simulations, these specifications can
include dimensions, difficulty, exercise type, setting, target population,
and so on. (Thornton & Mueller-Hanson, 2004). The original behavioral
simulation exercise should be built around these specifications. The spec-
ifications for the alternate form need to be extended to include specific
details pertaining to the initial simulation. This is necessary because the
broad content specifications will most likely not be detailed enough for the
construction of a comparable alternate form. Considerable detail is needed
to fully describe the multiple behavioral dimensions to be assessed. For the
construction of the alternate form, the deep structure that elicits the rated
behaviors must be differentiated from the surface features with which the
participant merely interacts. Extracting this information from the original
simulation is a difficult process. The degree to which the alternate form
is an exact clone of the original simulation involves balancing duplica-
tion of content to maintain equivalent opportunities for behavior while
150 PERSONNEL PSYCHOLOGY

disguising the simulation enough to remove the advantages of simulation


preknowledge. There is no definitive answer to where this balance lies,
and it should be determined by the nature of the specific testing situation.
If security is a major issue, then more surface changes may be made at the
possible cost of exercise equivalence. Information that can be extracted
from the original simulation that may not be included in the simulation
specifications could include the specific topics covered in the simulation,
the number of pages and type of materials the participants need to eval-
uate, the status of the participant within the simulated organization, and
so on. This type of information can only be discerned through high levels
of familiarity with the operational functioning of the original simulation.
Even after these detailed specifications are created, the construction of the
alternate form of the simulation will involve an iterative comparison with
the original simulation for these features.
Step 2: Construct alternate simulation exercise(s). Actually con-
structing the alternate simulation exercises involves a creative process of
generating new surface structure around the simulation specifications. Of-
ten, simulation exercises within a single assessment center are linked. For
example, all of the simulations could occur within the same hypothetical
organization (Thornton & Mueller-Hanson, 2004). If this is the case, and
if it is the goal of the test developer to develop a parallel assessment center,
then an alternate organization will need to be overlaid across all of the
alternate exercises, and the alternate exercises will need to be linked in
ways similar to that of the original set of exercises. When the simulation
exercises are originally constructed to stand alone, the alternate surface
content can be anything that does not substantially change the type and
difficulty of the behaviors elicited for each dimension measured by the
simulation exercise.
Once again, writing the materials for an alternative simulation exer-
cise should be a somewhat iterative process. Maintaining similar mea-
surement of the dimensions requires using the test specifications and the
surface content in the same way for each test version. After the simula-
tion developer(s) have constructed new forms that are hypothesized to be
parallel, the exercise content should be examined by experts, pilot tested,
and modified where appropriate. These alternate simulations can then be
administered, and after sufficient levels of participation, the scores on the
forms can be compared and, if necessary, equated.
Step 3: Compare sample statistics and compute equating function (if
necessary). Once the test developers have done everything possible to
construct equivalent forms of the simulation exercises (International Task-
force, 2000), the simulations can be administered to the relevant popula-
tion and the means and standard deviation across versions compared (using
t- and chi-square-difference tests). As mentioned above, the first step is
BRADLEY J. BRUMMEL ET AL. 151

to conduct these analyses on within-exercise dimension ratings. However,


as mentioned above, depending on the nature of the assessment program,
these analyses might also need to be conducted on across-exercise di-
mension ratings, and overall assessment ratings. If means and standard
deviations are significantly different, an equating function can be com-
puted to adjust scores in future samples so that candidates taking different
forms of the simulation exercises can be precisely compared. This step in
simulation development would ideally be done prior to implementing the
alternate simulations into the assessment program (so that the equating
function can be determined ahead of time, and all scores obtained once
the program is operational adjusted using the equating function). It is im-
portant to note that both the size and representativeness of the sample will
affect the accuracy of the equating function. Consequently, the equating
sample should be as large as possible.
As mentioned above, the nature of behavioral assessment introduces
assessors as an additional source of variance that is not typically accounted
for in standard equating functions (which typically only deal with test dif-
ficulty). Assessor variance is not the same as variance due to the difficulty
of the simulation and should not be included in the equating process if
it can be estimated independently. If the testing situation does not allow
for the estimation of assessor variance, then this source of variance can
be minimized through assessor training and calibration (e.g., Schleicher
et al., 2002) and has to be assumed to be zero for the purposes of test
equating. Depending upon the design of the simulations, it may be possi-
ble to use generalizability theory to partition out the assessor effects and
assessor-by-exercise effects before equating the scores on the simulation
exercises (Shavelson & Webb, 1991).
The essentially tau-equivalent equating function takes the form

μ1 = μ2 + a,

where μ 2 is the alternate dimension rating, μ 1 is the original dimension


rating, and a is a constant. The dimension ratings can be equated by adding
the mean score differences between the forms to each individual’s score
on the second version.
The congeneric equating function takes the form

μ1 = (δ1 /δ2 ) ∗ μ2 + a,

where μ 2 is the alternate dimension rating, μ 1 is the original dimension


rating, δ 1 and δ 2 are the standard deviation of the dimension ratings, and
a is a constant. The equating function for the dimension can be found by
first multiplying the alternate dimension rating by the ratio of the standard
deviations and then adding the constant to solve the equation.
152 PERSONNEL PSYCHOLOGY

By following these guidelines, researchers and practitioners can con-


struct alternate forms of behavioral simulation exercises with the same
rigor and care for construct validity as is commonly done when using
other test formats. This kind of rigor in test construction is required if
behavioral simulations are to receive the same respect afforded to more
common testing formats.

Establishing Alternate Simulations: An Empirical Test

To test the recommendations put forth in this paper, we applied the


guidelines to the construction of parallel simulation exercises in an oper-
ational assessment center. This assessment center consisted of six simu-
lation exercises. Thus, our goal was to construct six parallel simulation
exercises, resulting in six pairs of exercises that were essentially inter-
changeable within a pair. The assessment center measured eight perfor-
mance dimensions (see Table 1). This program was developmental in
nature, and thus the practical goal of this activity was to create a parallel
assessment center, which was to be used to reassess participants 1 year
following participants’ first assessment, in order to measure learning and
change on the dimensions.

Methods

Participants

Participants were 185 managers from a variety of organizations in the


midwestern United States. These organizations represented a variety of
industries from banking to manufacturing to insurance. Fifty-three percent
of the participants were women. Eighty-seven percent were Caucasians.
The majority of participants had at least a college degree (74%). Most
of the managers were middle managers (67%). Upper-level managers
and lower-level managers were equally represented in the sample with
14% in each category. The average age of the participants was 41 years
(SD = 8.8 years). Their average tenure was 8.8 years (SD = 7.7 years) in
their organization and 4.7 years (SD = 5.0) in their current position. The
participants’ average year in the workforce was 19.9 years (SD = 9.8).
Participants were divided into two samples. Sample 1 consisted of 100
managers who participated in the original set of six simulation exercises
during the first year of program operation. Sample 2 consisted of 85
managers who participated in a parallel set of six simulation exercises
during the program’s second year of operation. The use of two separate
samples (as opposed to one sample assessed at two time points) was
necessary due to the developmental nature of the program. That is, the
BRADLEY J. BRUMMEL ET AL. 153

program was designed to foster experiential learning and performance


improvement. Participants received detailed verbal and written feedback,
coaching, and goals regarding dimension-level performance immediately
following their completion of the simulation exercises. Thus, participants’
expected improvement on the dimensions limits our ability to test for
parallelism using the same participants. Using two distinct participant
samples is appropriate in this case in that the groups are not affected by
previous participation. However, the degree to which group differences in
ability might be responsible for score differences becomes an important
issue.
In an attempt to ensure that the two samples were as equivalent as pos-
sible, we measured participants’ cognitive ability (using the Wonderlic
Personnel Test, Wonderlic & Associates, Inc., 1983) and personality (us-
ing the Big Five Inventory, John & Srivastava, 1999). Independent sample,
two-tailed t-tests indicated no significant differences on these character-
istics between our two samples: cognitive ability (t = 1.36, p = .18),
Extraversion (t = .91, p = .36), Agreeableness (t = .84, p = .40), Open-
ness (t = 1.38, p = .17), Conscientiousness (t = −1.80, p = .07), and
Emotional Stability (t = .29, p = .78). Although these tests are not con-
clusive evidence that our two samples were equivalent, they do provide
some evidence that score differences are not the direct result of big dif-
ferences in commonly assessed individual difference variables, known to
be critical to job performance (Schmidt & Hunter, 1998).

Procedures

Exercises. Participants took part in a developmental assessment cen-


ter, consisting of six simulation exercises, designed to measure eight
behavioral dimensions. Each exercise was designed to measure five di-
mensions. The exercises included two interview simulations (IS), two
leaderless group discussions (LGD), and two case study analyses with
presentation exercises (CP). One of each type of exercise was done in the
morning and afternoon. The morning and afternoon sets of exercises were
organized into separate blocks, with all the morning exercises simulating
one organizational context and all of the afternoon exercises simulating
another.
The morning LGD involved making personnel decisions regarding a
set of employees and the afternoon LGD involved making recommen-
dations regarding company policy. The LGDs were noncompetitive and
no roles were assigned. The morning IS involved counseling a troubled
employee, and the afternoon IS involved resolving a conflict between
two subordinates. Employees with whom the participants had to inter-
act with in the IS exercises were role played by trained and experienced
154 PERSONNEL PSYCHOLOGY

undergraduate research assistants. The morning CP involved making hu-


man resource policy decisions, and the afternoon CP involved making
organizational budgetary decisions. The CP exercises were presented to
the assessor, who played the role of a representative of senior management.
These basic descriptions, the dimensions assessed in each exercise, and
the type and amount of content materials given to participants constitute a
summary of the deep structure of the exercises, which was held constant
when developing the alternate forms. More specifically, the length and
style of the company policies and the various e-mails and memos from
employees were considered deep structure, whereas the names, roles, and
specific complaints or inputs in these materials were considered surface
structure and was changed in the alternate form. The dimensions and their
definitions are listed in Table 1.
Assessors. Assessors consisted primarily of professionals enrolled in
an applied master’s degree program in human resource management. All
assessors had undergraduate degrees in psychology, had taken courses in
I-O psychology, and had work experience. Additional assessors were doc-
toral students studying industrial–organizational psychology, as well as
local (certified) human resource professionals. The assessor training and
certification program consisted of (a) going through the assessment center
as a true participant would, (b) 4 days of classroom training, (c) the shad-
owing of senior assessors, and (d) being shadowed by senior assessors.
Classroom instruction consisted of (a) testing on dimension definitions
and exercise content, (b) frame-of-reference training, and (c) training on
the avoidance of observational and rating errors. The frame-of-reference
training involved assessors watching video clips of participants in order
to practice observing, recording, and classifying behaviors, and making
dimension ratings. Ratings of assessors were compared, and discrepan-
cies were discussed. Continuous iterations of this process commenced
until consensus was achieved. The goal of the training was to make the
assessors as interchangeable as possible such that they all would rate the
same behaviors, elicited by the same person in the same exercise (or par-
allel exercise), in the same way. Frame-of-reference training was repeated
every 6 months to ensure that assessors remained well calibrated in their
frame of reference.
Dimension scoring and assessment process. Assessors provided
within-exercise dimension ratings using BARS. These BARS were
developed to facilitate equivalent ratings across assessors. For each
exercise, three BARS were used to rate each dimension assessed,
which corresponded to the three aspects of each dimension’s defi-
nition (see Table 1). The BARS utilized a 7-point scale with be-
havioral anchors at 3 to 5 rating points per scale. The behav-
ioral anchors were developed by the group of subject matter experts
BRADLEY J. BRUMMEL ET AL. 155

(I-O psychology faculty and doctoral students) who developed and pi-
lot tested the assessment center. An original simulation exercise and its
pair (designed to be parallel) used the exact BARS.
The developmental assessment center program was designed to func-
tion with six participants and six assessors. Assessees and assessors were
rotated such that every participant was assessed by more than one assessor,
and each assessor assessed more than one type of exercise. Participants
received detailed, developmental feedback from an assessor that was as-
signed to them following their participation. In the feedback sessions,
each dimension was discussed in turn. Examples of effective and inef-
fective behavior elicited by the participant in the exercises were provided
for each dimension. The feedback session concluded with the formation
of developmental goals and action steps aimed at improving proficiency
on each of the dimensions. Participating in the full set of exercises and
receiving feedback took an entire 8-hour day.

Constructing the Alternate Simulation Exercises

We followed the process outlined previously to construct alternate


forms of each of the original set of six exercises. A subject matter expert
team of certified administrators and assessors familiar with the exercises
used in the original assessment center participated in the construction of
the alternate exercise forms.
Simulation exercise specifications. First, we constructed detailed
specifications for each of the six simulations. The content specifications
from the original simulations included each exercise’s type, dimensions
to be assessed, target population, and timing. Through a close examina-
tion of each exercise, we were able to construct more detailed simulation
specifications, which included information on the number of role players
and specifics of their roles; the number of pages, type, and nature of the
stimuli material provided; the nature of the conflicts and problems that
the participants encountered within the simulations; and the anticipated
difficulty for displaying certain behaviors within a specific simulation.
For example, the afternoon interview simulation involved two sub-
ordinates, portrayed by role players. The exercise involved two recently
merged companies. The subordinates were each from the different com-
panies and had various conflicts driven by differences in the two organiza-
tional cultures. We used this information to create matched specifications
for the parallel version of the exercise. Both interview simulations re-
quired a final decision to be made and written down in a memo provided
to the participants. In addition, the job descriptions, tips for approaching
the task, and descriptions of the organizational cultures at the two merging
companies were of similar length. The participants had the ability to meet
156 PERSONNEL PSYCHOLOGY

with role players in whatever style they preferred in both situations: one
at a time, together, in a specific order, or multiple times. The role players
were given two-page profiles of their characters. These characters had sim-
ilar personalities and workplace disputes from their former workplaces:
One was more individualistic and hierarchy oriented whereas the other
one was more collaborative and team oriented. After the detailed specifi-
cations had been agreed upon, the construction of the alternate forms of
these simulations commenced.
Construction of parallel simulation exercises. The detailed simula-
tion specifications were then used to construct an alternate form for each
of the original simulation exercises. The subject matter experts’ experi-
ence with the original exercises provided a working knowledge of the
difficulty of the original exercises—which we found to be essential to this
process. When it was unclear how to apply the detailed specifications to
the alternate form, the original exercise was reexamined for clarification.
Once alternate forms for all six exercises were completed, pilot testing
was conducted to help evaluate how the new simulations would function
in an operational assessment center setting.
As mentioned earlier, the assessment center program was designed
such that a different organization was simulated in the morning and after-
noon. The morning exercises were cast within a luggage manufacturing
company, and the afternoon exercises were cast in two recently merged
computer chip manufacturing companies. As a result, in constructing the
parallel simulation exercises, we also had to design materials describ-
ing two additional organizations in which the morning and afternoon
exercises of our parallel assessment center would be set. The subject mat-
ter expert team decided on service-based organizations (as opposed to
manufacturing-based organization that comprised the original assessment
center) as a means of altering aspects of the surface structure of the origi-
nal exercises (as recommended in the guidelines). Consequently, we cast
the alternate morning exercises in an advertising firm and the afternoon
exercises in two recently merged banks.
Developing the organizational contexts for the exercises began with
each subject matter expert reviewing the organizational contexts from
the original assessment center. They then took one detail of the or-
ganizational background at a time to clone in the new settings. Then,
both the detailed simulation specifications and these new organiza-
tional background materials were used to construct an alternate form
of each original simulation. In addition, the morning and afternoon ex-
ercises, respectively, were compared to each other to check for log-
ical coherence with the other exercises in that block. The full sub-
ject matter expert team met weekly to review progress and discuss the
exercises.
BRADLEY J. BRUMMEL ET AL. 157

To pilot test the exercises, both the original and alternate simulation ex-
ercises were administered to a group of advanced undergraduate students
at a large midwestern university.2 All participants were video recorded.
Experts watched the videos to judge the extent to which they believed that
the difficulty of the alternate forms were similar to the original simula-
tion exercises in terms of eliciting dimension-relevant behaviors. Using
this information, additional adjustments were made to some of the ex-
ercises. Once every possible attempt was made to emulate the detailed
content specifications and difficulties of the original simulation exercises,
the alternate forms were ready to be administered to Sample 2.
Sample comparison. Sample 2 participated in the alternate forms of
the simulation exercises. Once they had completed the exercises, we were
then able to use the results from Sample 2 to evaluate the equivalence
of the alternate forms to the original exercises completed by Sample 1.
To test whether exercise pairs were parallel, we compared the means and
standard deviations of each dimension assessed in each exercise. Further,
to test whether the two assessment centers, as a whole, were parallel, we
compared the across-exercise dimension ratings for the two groups.

Results

Comparing (Raw) Sample Statistics

We first compared Sample 1 to Sample 2 means and standard de-


viations on each dimension assessed in each exercise. These results are
reported in Table 3. Overall, the alternate forms produced lower dimension
scores than the original versions. The average effect size for this differ-
ence across exercises and dimensions was d = .36. The CP simulation
exercises showed the smallest differences with effect sizes less than .10
in the morning and .44 in the afternoon. The IS and the LGD simulation
exercises showed larger differences across dimension types. The largest
single difference was for the fairness dimension in the first LGD (d =
1.19).
As we explained above, we were also able to conduct these same
analyses on across-exercise dimension ratings, giving us an indication of
the equivalence of the assessment centers as a whole. These results are
reported in the left-hand columns of Table 4. Here we also see that the
alternate form of the assessment center produced lower dimension scores

2
It should be noted that this pretest population was not the intended audience for the
simulation exercises. However, we have no reason to believe that the lower level of skill
and experience in the pilot sample differentially affected their scores on the original and
alternate forms.
TABLE 3
158

Sample 1–Sample 2 Differences on Within-Exercise Dimension Ratings

Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Case and Presentation 1
Information seeking 1 98 4.63 1.49 .08 .40 .69 .06 2.48 .12
2 85 4.55 1.21
Problem solving 1 99 4.42 1.33 −.11 −.62 .54 −.09 2.56 .11
2 85 4.54 1.12
Oral communication 1 99 4.64 1.44 .08 .41 .68 .06 16.56 .00
2 85 4.56 1.03
Planning and organizing 1 97 4.30 1.43 −.10 −.50 .62 −.07 2.12 .15
2 85 4.40 1.25
Fairness 1 99 4.32 1.21 .00 .01 1.00 .00 .26 .61
2 85 4.32 1.12
Interview Simulation 1
Information seeking 1 100 4.99 1.34 .68 3.49 .00 .52 .06 .81
2 85 4.31 1.27
PERSONNEL PSYCHOLOGY

Planning and organizing 1 100 4.71 1.30 1.09 5.95 .00 .89 1.68 .20
2 85 3.61 1.15
Leadership 1 100 5.01 1.19 .97 5.05 .00 .76 2.57 .11
2 85 4.04 1.40
Conflict management 1 100 4.99 1.21 .81 4.33 .00 .65 1.06 .31
2 85 4.18 1.29
Fairness 1 100 4.86 1.03 .16 .89 .38 .14 7.40 .01
2 85 4.70 1.36
TABLE 3 (continued)

Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Leaderless Group 1
Problem solving 1 100 4.90 1.19 .72 4.43 .00 .65 5.56 .02
2 85 4.18 1.00
Oral communication 1 100 4.69 1.14 .59 3.79 .00 .57 2.94 .09
2 85 4.11 .92
Leadership 1 100 4.45 1.25 .57 3.22 .00 .48 2.28 .13
2 85 3.88 1.13
Conflict management 1 100 4.82 1.11 .94 5.57 .00 .83 .07 .79
2 85 3.89 1.16
Fairness 1 100 5.33 .98 1.22 8.07 .00 1.19 1.38 .24
2 85 4.11 1.07
Case and Presentation 2
Information seeking 1 100 4.22 1.26 −.18 −1.14 .26 −.17 8.80 .00
2 85 4.40 .92
BRADLEY J. BRUMMEL ET AL.

Problem solving 1 100 4.36 1.23 .31 1.89 .06 .28 5.67 .02
2 85 4.05 1.00
Oral communication 1 100 5.05 1.15 .50 2.97 .00 .44 .31 .58
2 85 4.56 1.11
Planning and organizing 1 100 4.37 1.41 .05 .28 .78 .04 9.09 .00
2 85 4.31 1.14
Cultural adaptability 1 100 4.11 1.51 .14 .67 .50 .10 6.04 .02
2 85 3.97 1.25
159
160
TABLE 3 (continued)

Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Interview Simulation 2
Information seeking 1 100 5.36 1.26 .79 4.40 .00 .65 .38 .54
2 85 4.57 1.15
Planning and organizing 1 100 4.56 1.34 .16 .85 .40 .13 2.18 .14
2 85 4.40 1.21
Leadership 1 100 4.75 1.23 .26 1.44 .15 .21 .75 .39
2 85 4.48 1.24
Conflict management 1 100 4.79 1.25 .48 2.42 .02 .36 1.77 .19
2 85 4.31 1.43
Cultural adaptability 1 100 4.91 1.47 .63 3.15 .00 .46 5.15 .02
2 85 4.28 1.22
Leaderless Group 2
Problem solving 1 100 4.70 1.26 .40 2.36 .02 .35 1.87 .17
2 85 4.30 1.00
Oral communication 1 100 4.92 1.16 .22 1.45 .15 .21 4.98 .03
PERSONNEL PSYCHOLOGY

2 85 4.70 .90
Leadership 1 100 4.58 1.30 .29 1.73 .09 .25 4.38 .04
2 85 4.28 .99
Conflict management 1 100 4.71 1.03 .55 3.66 .00 .55 .00 .98
2 85 4.15 .98
Cultural adaptability 1 100 4.47 1.39 .26 1.43 .16 .21 7.10 .01
2 85 4.21 1.06
TABLE 4
Sample 1 Sample 2 Difference on Across-Exercise Dimension Ratings

Equality of
Equality of means variances
Dimensions Sample N M SD Difference t p d F p
Information seeking 1 100 4.82 .88 .39 3.09 .00 .46 .19 .66
2 85 4.43 .83
Planning and organizing 1 100 4.52 1.02 .34 2.39 .02 .35 2.04 .16
2 85 4.18 .90
Problem solving 1 100 4.61 .89 .35 2.94 .00 .43 4.69 .03
2 85 4.26 .73
Oral communication 1 100 4.83 .91 .35 2.89 .00 .42 7.05 .01
2 85 4.48 .74
Conflict management 1 100 4.85 .83 .74 5.79 .00 .85 .25 .61
2 85 4.11 .90
BRADLEY J. BRUMMEL ET AL.

Leadership 1 100 4.72 .86 .57 4.42 .00 .65 .01 .92
2 85 4.15 .89
Fairness 1 100 4.84 .72 .46 3.95 .00 .58 2.32 .13
2 85 4.38 .86
Cultural adaptability 1 100 4.50 1.15 .35 2.34 .02 .34 6.50 .01
2 85 4.14 .89
161
162 PERSONNEL PSYCHOLOGY

than did the original version, although size of difference depended on the
dimension assessed. We found the greatest discrepancies for the dimen-
sions leadership (d = .85) and conflict management (d = .65).

Computing and Applying Equating Functions

Based on these results, we can conclude that neither the exercises nor
the assessment center as whole were parallel or tau-equivalent. Therefore,
it becomes necessary to compute equating functions so that the ratings of
future assessment center candidates participating in the alternate assess-
ment center can have their scores adjusted so that scores resulting from
alternate exercises (or across-exercise dimension ratings from alternate
assessment centers) are directly comparable. This particular issue was of
great importance for the program in which we collected our data. As we
explained above, this was a developmental program, where individuals
would take part in the original set of assessment center exercises, receive
extensive feedback, and work toward developmental goals over the course
of an year. The program designed the parallel version of the assessment
center in order to reassess participants 1 year later to determine their
progress on improving their proficiency on the dimensions. The program
administrators felt it problematic to put participants through the same set
of exercises, in that the participants were provided with detailed feedback
on how they could have more effectively completed the exercises, and the
center was designed to assess global proficiency on the dimensions rather
than task performance surrounding a particular simulated situation. How-
ever, the use of the alternate assessment center stood to be problematic as
well without information on the relative difficulty of the two sets of sim-
ulation exercises. Indeed, if the exercises used for the reassessment were
easier than the Time 1 exercises, then improvement could be inferred when
it was in fact not there. If the exercises were more difficult, then evidence
for true learning and improvement would be masked. Consequently, the
process of comparing the sample statistics from the two samples described
above allowed the program administrators to (a) realize that their alternate
exercises were not parallel and (b) use the data they had collected so far
to compute equating functions. By doing so, the program staff could put
participants through the alternate assessment center at Time 2, adjust their
scores using the equating function, and more accurately determine their
development on the dimensions over time.
To show the impact of this method, we point the readers to
Tables 5 and 6. In Table 5 we provide the equating functions that were
computed using the Sample 1 and Sample 2 across-exercise dimension
ratings. Note that our method for computing the equating function de-
pended on whether we determined the dimension ratings to be essentially
BRADLEY J. BRUMMEL ET AL. 163

TABLE 5
Equating Relationship for the Assessment Center Dimensions

Dimensions Equating functions


Information seeking IS2’ = IS2 + .39
Planning and organizing PO2’ = PO2 + .34
Problem solving∗ PS2’ = (PS2 ∗ 1.22) − .59
Oral communication∗ OC2’ = (OC2 ∗ 1.24) − .72
Conflict management CM2’ = CM2 + .74
Leadership LS2’ = LS2 + .57
Fairness FA2’ = FA2 + .46
Cultural adaptability∗ CA2’ = (CA2 ∗ 1.29) − .86

Equating relationship determined to be congeneric.

TABLE 6
A Comparison of 22 Reassessed Participants Using Equated and Nonequated
Scores on the Alternate Forms

Original Alternate Nonequated Equated Equated


forms forms difference scores difference
Dimensions M SD M SD M d M SD M d
Information 4.58 .99 4.48 .70 −.09 −.10 5.19 .70 .61 .72
seeking
Planning and 4.28 1.08 4.32 .83 .05 .05 5.15 .83 .88 .92
organizing
Problem 4.48 .99 4.41 .76 −.07 −.08 4.79 .93 .31 .33
solving∗
Oral 4.63 .79 4.60 .72 −.03 −.03 4.98 .90 .35 .42
communication∗
Conflict 4.68 .99 4.06 .78 −.62 −.72 4.84 .78 .16 .18
management
Leadership 4.62 .74 4.17 .74 −.45 −.60 4.91 .74 .29 .40
Fairness 4.77 .83 4.58 .74 −.19 −.24 5.04 .74 .27 .35
Cultural 3.93 1.35 4.14 .73 .21 .31 5.05 .96 1.12 .97
adaptability∗

Equating relationship determined to be congeneric.

tau-equivalent (different means) or congeneric (different means and stan-


dard deviations). We then applied the equating functions to a sample of 22
managers who had taken part in the original assessment center, as well as
the alternate form of the assessment center, 1 year later.3 That is, we used
the equating function to adjust the dimension scores from the alternate
assessment center.

3
The demographic characteristics of this sample were virtually identical to the charac-
teristics of Samples 1 and 2, described in the Methods section.
164 PERSONNEL PSYCHOLOGY

Comparing the nonequated differences to the equated differences re-


veals the importance of this method. That is, had the program adminis-
trators simply based their interpretations on the nonequated dimension
scores, they would have concluded that not only did participant profi-
ciency on the dimensions not improve, but it also decreased on all but
two dimensions! The conclusion would have been that the program was
ineffective and much time and money had been wasted. Looking at the
equated scores paints a much more positive picture. Here, we see that now
that the scores have been adjusted for differences in difficulty between the
alternate exercises, there is indeed evidence of learning and change over
the course of the year.

Discussion

In summary, this paper sought to highlight the importance of estab-


lishing parallel versions of behavioral simulation exercises. Herein, we
made the argument that the construction of parallel simulation exercises
should follow as stringent a protocol as that used by developers of tra-
ditional paper-and-pencil tests. We note, however, that there are several
unique features of behavioral simulations that limit our ability to simply
generalize the guidelines from this domain. As such, we provided a set of
guidelines that sought to apply the traditional test equating literature to this
special assessment context. We then attempted to follow our recommen-
dations in constructing parallel simulation exercises for a developmental
assessment center program.
Our results showed that even when following strict test specifications,
it is difficult to construct simulation exercises that will be equally diffi-
cult on any or all of the dimensions being measured. Our first case and
presentation exercise was close to being similar in difficulty across all
dimensions, but none of the other five alternate forms had difficulty sim-
ilar to that of the original simulation exercise they were matched with.
At the assessment center level of analysis (i.e., across-exercise dimension
ratings), all dimensions required test equating for the scores to be mean-
ingfully compared. It appears that diligent construction of alternate forms
based upon test specifications is merely a necessary but not a sufficient
requirement for constructing comparable alternate forms of simulation
exercises.
As mentioned at the start of this paper, assessment center scholars
have long debated whether dimensions are the most ideal focal constructs
in assessment center programs, pointing to studies taking a multitrait–
multimethod approach to analyzing within-exercise dimension ratings that
have shown a lack of discriminant validity evidence (e.g., Lance, 2008).
As we reviewed above, there is burgeoning evidence to suggest that,
BRADLEY J. BRUMMEL ET AL. 165

in fact, when more contemporary and psychometrically sound methods


are used to explore construct validity, the evidence is quite supportive
for the use of dimensions (e.g., Rupp et al., 2008). However, given the
debate, we feel it is necessary to at least consider how our guidelines
could be applied to simulation exercises that measure unitary, task-based
constructs (as recommended by Lance).
In essence, our recommendations would still apply to unitary, task-
based constructs. The difficulty of parallel versions of these simulations
would still be important for all the same reasons discussed for the mul-
tidimensional, dimension-based simulations. Therefore, for simulations
designed to measure task-based constructs, care should be taken to con-
struct simulation specifications, construct parallel versions, and equate
those versions if necessary. The equating would only be necessary on
the overall score as these simulations are designed to be unidimensional
measures of performance.

Moving Beyond Assessment Center Contexts

Although this paper has focused on behavioral simulation exercises,


many selection contexts use dimension-based or themed structured inter-
views to make decisions about individuals (Campion, Palmer, & Campion,
1997; Schmidt & Rader, 1999). These structured interviews have many
of the same characteristics as simulations exercises including rater biases
and high within-interview correlations of dimensions (Van Iddekinge,
Raymark, Eidson Jr., & Attenweiler, 2004). Parallel versions of these as-
sessments can also be constructed and equated according to the guidelines
described in this paper. Such assessments are susceptible to compromise
because of their high stakes and structured scoring rubrics. The growing
literature on alternate forms for situational judgment tests speaks more
directly to these situations (Clause et al., 1998, Lievens & Sackett, 2007;
Oswald et al., 2005).

Test Equating

Because true parallelism is difficult to achieve, equating functions


will most likely be needed to compare scores on alternate simulation
exercises. When alternate simulation exercises are assumed to be essen-
tially tau-equivalent or congeneric, equating functions can be calculated
to provide equivalent scores on the dimensions. In behavioral simulation
exercises, additional sources of error variance, small sample sizes, issues
with population invariance, and dimension unreliability may influence the
level of confidence in the equating functions. However, we believe that
in the absence of evidence for these other reasons for dimension score
166 PERSONNEL PSYCHOLOGY

differences, the use of equated scores is still preferable to the use of raw
simulation scores.

Limitations and Suggestions

The illustrative assessment center used in this paper contained sev-


eral limitations that may affect the conclusions that can be drawn about
establishing parallelism among all simulation exercises. First, in this pro-
gram, participants were only observed by one assessor in each exercise
(although they were observed by multiple assessors across exercises). This
limitation did not allow us to estimate the reliability of our ratings4 or the
variance due to assessors. In addition, our overall sample size was limited,
and our sample of repeat participants was quite small. This limited the in-
ferences we could make and the confidence we could have in our equating
functions. We were also unable to conduct correlational tests of the par-
allelism of alternate forms because our participants only completed one
set of the exercises before receiving feedback. Therefore, we had to rely
on our simulations specifications to construct comparable simulations. A
preferable design would have been to have participants take part in both
versions of the simulation exercises in a counterbalanced order followed
by correlational tests of scores across simulations.5
That said, this experience did provide us with insights that allow us to
make some recommendations for others constructing alternate forms of
behavioral assessments. The first recommendation is to create as detailed
as possible specifications of all the stimuli used in a simulation and the
behaviors that those stimuli could elicit. This will make any future con-
struction of parallel forms easier. A second recommendation is to change
as little as possible in the alternate form. That is, we believe that chang-
ing the nature of the organization we were simulating in our alternate
exercises (i.e., from a manufacturing company to a service-oriented orga-
nization) may have impacted the relative difficulty of the exercises. We
do not believe this difference was due to a misfit between organization
type and the managers in our program as it was designed to measure di-
mensions relevant to all managers, but the overall effect of organization
type is something that should be examined in future research. Once again,
the degree of change in the simulation will depend upon the cost and

4
We did, however, establish interrater agreement within the assessor training program.
Given the extensiveness of our frame-of-reference training, we do not feel we were at risk
of low interrater reliability (compared to the average AC program). However, it would have
been nice for the purpose of this paper to run this calculation due to its importance to
establishing parallelism.
5
We thank the editor for this suggested research design for a more thorough test of
parallelism.
BRADLEY J. BRUMMEL ET AL. 167

likelihood of simulation preknowledge affecting valid dimension mea-


surement in a specific assessment situation. We cannot make any general-
izations about where the balance between exact replication and substantial
change should be at this time.

Conclusion

In conclusion, this study sought to fill a gap in the literature on con-


structing parallel simulation exercises. Whereas recent research has ex-
plored issues such as procedures for developing objectively scored in-
baskets (Lievens & Anseel, 2008) and situational judgment tests (Clause
et al., 1998; Lievens & Sackett, 2007; Oswald et al., 2005), this work does
not address the unique issues involved with behavioral assessment. As a
result, we sought to integrate the research on establishing parallelism with
that of behavioral assessment and assessment center methods in order to
create guidelines for creating parallel behavioral simulations. Although
this is a developing process, we believe that we have created something
that should be useful to other researchers and practitioners whose work
also requires them to attempt the construction of comparable, alternate
forms of behavioral simulations.

REFERENCES

Angoff WH. (1971). Scales, norms, and equivalent scores. In Thorndike RL (Ed.), Educa-
tional measurement (2nd ed., pp. 508–600). Washington, DC: American Council on
Education.
Arthur W Jr., Day EA, McNelly TL, Edens PS. (2003). Effectiveness of training in or-
ganizations: A meta-analysis of design and evaluation features. P ERSONNEL P SY-
CHOLOGY , 56, 125–154.
Brannick MT, Michaels CE, Baker DP. (1989). Construct validity of in-basket scores.
Journal of Applied Psychology, 74, 957–963.
Brostoff M, Meyer HH. (1984). The effects of coaching on in-basket performance. Journal
of Assessment Center Technology, 7, 17–21.
Campion MA, Palmer DK, Campion JE. (1997). A review of structure in the selection
interview. P ERSONNEL P SYCHOLOGY, 50, 655–702.
Cizek GJ. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ:
Erlbaum.
Clause CS, Mullins ME, Nee MT, Pulakos E, Schmitt N. (1998). Parallel test form de-
velopment: A procedure for alternative predictors and an example. P ERSONNEL
P SYCHOLOGY, 51, 193–208.
Connelly BS, Ones DS, Ramesh A, Goff M. (2008). A pragmatic view of assessment center
exercises and dimensions. Industrial and Organizational Psychology: Perspectives
on Science and Practice, 1, 121–124.
Coulton GF, Feild HS. (1995). Using assessment centers in selecting entry-level police
officers: Extravagance or justified expense? Public Personnel Management, 24(2),
223–254.
168 PERSONNEL PSYCHOLOGY

Cronbach LJ. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12,
1–16.
Dennis I, Handley SJ, Bradon P, Evans J, Newstead SE. (2002). Towards a predictive model
of the difficulty of analytical reasoning items. In Irvine SH, Kyllonen PC (Eds.),
Item generation and test development (pp. 53–71). Mahwah, NJ: Erlbaum.
Dorans NJ. (2004a). Equating, concordance, and expectation. Applied Psychological Mea-
surement, 28, 227–246.
Dorans NJ. (2004b). Editor’s introduction: Assessing population sensitivity of equating
functions. Journal of Educational Measurement, 41, 1–2.
Goldstein HW, Yusko KP, Nicolopoulos V. (2001). Exploring Black-White subgroup dif-
ferences of managerial competencies. P ERSONNEL P SYCHOLOGY, 54, 783–807.
Hardison CM, Sackett PR. (2004, April). Assessment center criterion-related validity: A
meta-analytic update. Poster presented at the 19th Annual Conference of the Society
for Industrial and Organizational Psychology, Chicago, IL.
Harris MM, Becker AS, Smith DE. (1993). Does the assessment center scoring method
affect the cross-situational consistency of ratings? Journal of Applied Psychology,
78, 359–378.
Howard A. (2008). Making Assessment centers work the way they are supposed to. In-
dustrial and Organizational Psychology: Perspectives on Science and Practice, 1,
98–104.
Huck JR, Bray DW. (1976). Management assessment centers evaluations and subsequent
job performance of Black and White females. P ERSONNEL P SYCHOLOGY, 26,
13–30.
International Taskforce on Assessment Center Guidelines. (2000). Guidelines and ethical
considerations for assessment center operations. Public Personnel Management, 29,
315–331.
Irvine SH, Dann PL, Anderson JD. (1990). Towards a theory or algorithm-derived cognitive
test structure. British Journal of Psychology, 81, 173–195.
John OP, Srivastava S. (1999). The big-five trait taxonomy: History, measurement, and
theoretical perspectives. In Pervin L, John OP (Eds.), Handbook of personality:
Theory and Research (2nd ed., pp. 102–138). New York: Guilford.
Kolen ML. (2004a). Linking assessments: Concept and history. Applied Psychological
Measurement, 41, 219–226.
Kolen ML. (2004b). Population invariance in equating and linking: Concept and history.
Journal of Educational Measurement, 41, 3–14.
Kolk NJ, Born MP, Van der Flier H. (2002). Impact of common rater variance on construct
validity of assessment center dimension judgments. Human Performance, 15, 325–
338.
Krause DE, Thornton GC III. (2007, April). A comparison of assessment center practices in
Western Europe and North America. Poster presented at the 22nd Annual Conference
of the Society of Industrial Organizational Psychology, New York.
Kudisch JD, Avis JM, Fallon JD, Thibodeaux HF, Roberst FE, Rollier TJ, et al. (2001,
April). A survey of assessment center practices worldwide: Maximizing innovation
or business as usual? Paper presented at the 16th Annual Conference of the Society
for Industrial Organizational Psychology, San Diego, CA.
Lance C. (2008). Why traditional explanations for assessment center validity are wrong.
Industrial and Organizational Psychology: Perspectives on Science and Practice,
1, 84–97.
Lance CE, Lambert TA, Gewin AG, Lievens F, Conway JM. (2004). Revised es-
timates of dimension and exercise variance components in assessment center
post-exercise dimension ratings. Journal of Applied Psychology, 89(2), 377–
385.
BRADLEY J. BRUMMEL ET AL. 169

Lievens F, Anseel F. (2008). Creating alternate in-basket forms through cloning: Some
preliminary results. International Journal of Selection and Assessment, 15, 428–
433.
Lievens F, Sackett PR. (2007). Situational judgment tests in high-stakes settings: Issues
and strategies with generating alternate forms. Journal of Applied Psychology, 92,
1043–1055.
Lowry PE. (1996). A survey of the assessment center process in the public sector. Public
Personnel Management, 25(3), 307–321.
McLeod L, Lewis C, Thissen D. (2003). A Bayesian method for the detection of item pre-
knowledge in computerized adaptive testing. Applied Psychological Measurement,
27(2), 121–137.
Moses JL, Ritchie RJ. (1976). Supervisory relationship training: A behavioral evalua-
tion of a behavioral modeling program. P ERSONNEL P SYCHOLOGY, 29, 337–
343.
Nunnally JC, Bernstein IH. (1994). Psychometric theory (3rd ed.). New York: McGraw-
Hill.
O’Brien ML. (1989). Psychometric issues relevant to selecting items and assembling par-
allel forms of language proficiency instruments. Educational & Psychological Mea-
surement, 49, 347–353.
Oswald FL, Friede AJ, Schmitt N, Kim BK, Ramsey LJ. (2005). Extending a practical
method for developing alternate test forms using independent sets of items. Orga-
nizational Research Methods, 8, 149–164.
Petty MM. (1974). A multivariate analysis of the effects of experience and training upon
performance in a leaderless group discussion. P ERSONNEL P SYCHOLOGY, 27, 271–
282.
Rottinghaus PJ, Betz NE, Borgen FH. (2003). Validity of parallel measures of vocational
interests and confidence. Journal of Career Assessment, 11, 355–378.
Rupp DE, Gibbons AM, Baldwin AM, Snyder LA, Spain SM, Woo SE, et al. (2006). An
initial validation of developmental assessment centers as accurate assessments and
effective training interventions. Psychologist-Manager Journal, 9, 171–200.
Rupp DE, Snyder LA, Gibbons AM, Thornton GC III. (2006). What should managerial
developmental assessment centers be developing? Psychologist-Manager Journal,
9, 75–98.
Rupp DE, Thornton GC, Gibbons AM. (2008). The construct validity of the assess-
ment center method and usefulness of dimensions as focal constructs. Industrial
and Organizational Psychology: Perspectives on Research and Practice, 1, 116–
120.
Sackett PR, Dreher GF. (1982). Constructs and assessment center dimensions: Some trou-
bling empirical findings. Journal of Applied Psychology, 67(4), 401–410.
Schleicher DJ, Day DV, Mayes BT, Riggio RE. (2002). A new frame for frame-of-reference
training: Enhancing the construct validity of assessment centers. Journal of Applied
Psychology, 87(4), 735–746.
Schmidt FL, Hunter JE. (1998). The validity and utility of selection methods in personnel
psychology: Practical and theoretical implications of 85 years of research findings.
Psychological Bulletin, 124, 262–274.
Schmidt FL, Rader M. (1999). Exploring the boundary conditions for interview validity:
Meta-analytic validity findings for a new interview type. P ERSONNEL P SYCHOL -
OGY , 52, 445–464.
Schneider JR, Schmitt N. (1992). An exercise design approach to understanding assessment
center dimension and exercise constructs. Journal of Applied Psychology, 77, 32–41.
Shavelson RJ, Webb NM. (1991). Generalizability theory: A primer. Thousand Oaks, CA:
Sage.
170 PERSONNEL PSYCHOLOGY

Smith PC, Kendall LM. (1963). Retranslation of expectations: An approach to the con-
struction of unambiguous anchors for rating scales. Journal of Applied Psychology,
47(2), 149–155.
Spychalski AC, Quiñones MA, Gaugler BB, Pohley K. (1997). A survey of assessment
center practices in organizations in the United States. P ERSONNEL P SYCHOLOGY,
50, 71–90.
Thornton GC III, Mueller-Hanson RA. (2004). Developing organizational simulations: A
guide for practitioners and students. Mahwah, NJ: Erlbaum.
Thornton GC III, Rupp DE. (2003). Simulations and assessment centers. In Thomas JC
(Ed.), Hersen M (Series Ed.), Comprehensive handbook of psychological assess-
ment, Vol. 4: Industrial and organizational assessment. New York: Wiley.
Thornton GC III, Rupp DE. (2006). Assessment centers in human resource management.
Mahwah, NJ: Erlbaum.
Tillema HH. (1998). Assessment of potential, from assessment centers to development
centers. International Journal of Selection and Assessment, 6(3), 185–191.
Traub RE. (1994). Reliability for the social sciences: Theory and applications (Volume 3).
Thousand Oaks, CA: Sage.
van der Linden, WJ, Adema JJ. (1998). Simultaneous assembly of multiple test forms.
Journal of Educational Measurement, 35, 185–198.
Van Iddekinge CH, Raymark PH, Eidson CE Jr., Attenweiler WJ. (2004). What do structured
interviews really measure? The construct validity of behavior description interviews.
Human Performance, 17, 71–93.
Wonderlic & Associates, Inc. (1983). Wonderlic Personnel Test.
Zedeck S. (1986). A process analysis of the assessment center method. Research in Orga-
nizational Behavior, 8, 259–296.

Potrebbero piacerti anche