AMEE Guide No. 18: Standard Setting in Student Assessment: Miriam Friedman Ben-David

M edical Teacher, Vol. 22, No.
2, 2000
AMEE GUIDE
AMEE Guide No. 18: Standard setting in student

assessment
MIRIAM FRIEDMAN BEN-DAVID

Centre for Medical Education, University of Dundee, UK
Licensure, credentialling and academ ic institutions methods of standard setting for the complex nature of perform-
Med Teach Downloaded from informahealthcare.com by Universitetssykehuset Nord-Norge on 02/28/13
SUM M AR Y
are seeking new innovative approa ches to the assessm ent of profes- ance assessment. A number of standard-setting approaches
sional competence. Central to these recent initiatives is the need to are currently available for both written and performance tests.
determine standards of perform ance, which separate the com petent Information through recent research studies may assist the
from the non-com petent candidate. Setting standards for perform - reader to formulate an èducated’ choice of a particular
ance assessm ent is a relatively new area of study. Consequently, method. Some assessment methods or procedures may better
there is no one recom mended approa ch to setting standards. The ® t one standard-setting approach vs. another. Therefore, the
goal of this guide is to fam iliarize the reader w ith the fram ework, present guide will describe a number of approaches and will
pr inciples, key concepts and practical considerations of standard focus in detail on one commonly used approach, the `M odi-
setting approa ches and to enable the reader to m ake èducated’ ® ed Angoff’ , as an example and a guide for a standard-setting
choices in selecting the most appropr iate standard setting approach framework. The principles, key concepts and the practical
for their testing needs. considerations outlined in this guide may serve as common
For personal use only.
consideration for most standard-setting approaches. Whilst

setting standards on a written examination is essential for
W hy should we use standard-setti ng procedures?
good testing practice, owing to rapid developments in testing
In designing assessm ent tasks, test developers incorporate of professionals, special attention will be given in this guide to
meaningful and essential perform ance criteria designed to performance assessment. The practical example will outline a
provide evidence that candidates have successfully completed process of setting standards for performance assessment.
the task. Ideally, candidates should demonstrate mastery of However, it is important that the reader distinguishes between
com petence by responding correctly to the task criteria and the main characteristics of written examination and perform-
by achieving the maximum scoring points. However, in ance assessment. A section of this guide will highlight some
reality, candidates may demonstrate a variety of perform- questions concerning the differences between the two modes
ance pro® les that m ay range from non-competent through of assessment, with regard to standard-setting procedures.
minimally competent to fully com petent. A requirement of The m edical education ® eld will bene® t greatly from
mastery approach to performance (fully com petent) for ne w in novative ap pro ac hes to setting sta nd ards fo r
passing the test may appear unrealistic in most situations, performance-assessment m easures. It is hoped that the
owing to the complex nature of a medical task and measure- present guide will stimulate the reader to explore new ways
m e nt er ro r. C onsequ en tly, te st de velo per s nee d an of setting standards for minimally competent candidates as
educational tool by which they determine the cut-off point well as for fully competent candidates.
on the scoring scale that separates the non-competent from
the competent. The traditional approaches in de® ning such
K ey concepts
a cut-off point (i.e. responding correctly to 70% of the
items) do not provide robust and valid evidence for pass/fail
decisions. The problem intensi® es when the test results hold
T he goal of this guide
serious promotion or career consequences for the candidates.
These are the questions the standard-setting guide will To fam iliarize th e reader with th e fram ewo rk, th e
address by providing both theoretical and practical guidelines principles, key concepts and practical considerations of
to the reader. standard setting approaches and to enable the reader to
make èducated’ choices in selecting the most appropriate
standard setting approach for their testing needs.
The goal of this guide
Historically, standard-setting methods were most commonly

employed on multiple-choice and other written examinations
(Hambleton, 1995; Jaeger et al., 1996). During the last decade, Correspondence: Dr M iriam Friedman Ben-David, C entre for M edical Educa-
with the emergence of performance assessment methods in tion, U niversity of Dundee, Tay Park H ouse, 484 Perth Road, Dundee D D2
general, and credential examinations in the professions in 1LR , UK . Tel: + 4 4 (0 )1 382 6 319 76; f ax: + 44 (0 )13 82 645748; em ail:
particular, there was a need to re-examine the ® tness of existing mfbendavid@ aol.com
120 ISSN 0142-159X (print)/ISSN 1466-187X (online)/00/020120-11 ½ 2000 Taylor & Francis Ltd
AM EE G uide No. 18
Norm-referenced vs. criterion-referenced standards patients examination. Each station measures three skills:
history, physical exam ination and com munication. Each
In credentialling examinations where the stakes are high,
station contains separate checklist items (criteria) for each
standard-setting approaches may differ in their orientation of
skill component. Let us further assume that each station
norm referenced vs. criterion referenced. Credentialling
varies signi® cantly in difficulty in relation to history and
organizations may use norm-referenced orientation, in which
the standard is based on performance of an external large physical exam ination scores. However, perform ance on
representative sam ple (norm group) equivalent to th e comm unication skills does not vary substantially from one
candidates taking the test. The criterion-reference orientation station to anothe r, and a con stant difficu lty level is
links the standard to the content of the competence level maintained for all stations.
under consideration. The norm-referenced standard will be In any given station, all candidates’ positive responses
somewhat unstable and will shift according to the perform- per skill are summarized and divided by the highest possible
ance of the norm group, as large as it may be. The criterion- score and multiplied by 100 to produce a station skill
referenced standard is a ® xed standard that may undergo percentage score. The total test score is calculated by
periodic re-evaluation in view of shifts or trends in candidates’ averaging the station scores across the 15 stations. If the
performance over time (Nungester et al ., 1991). The norm- standard is set on the total test score, which is the average
referenced approach, employing a group-referenced standard, performance across stations, the skill standard will constitute
may result in reasonable standards providing the group is a com pensatory standard . This method of scoring permits
representative of the candidates’ population, is heterogeneous candidates to compensate for relatively poor skill perform -
and is large. Shift of the standard over time is a concern. ance on som e stations (Norcini et al ., 1993). Any combina-
At the school level, relative (norm) vs. absolute (criterion) tion of performance (skill scores) across the stations is
standards are also considered (Muijtjens et al ., 1998). A acceptable, as long as the examinee exceeds the skill perform-
relative standard can be set at the m ean performances of the ance standard for the total test (Hambleton, 1995).
candidates, or by de® ning the units of standard deviation Test developers or decision makers may, however, insist
from the mean. Percentile ranks, as well as the m edian that a number of stations should be com petently m anaged
point, could also be set as norm-referenced standards. These before a passing score is warranted. This m ethod, which
standards may vary from year to year due to shifts in the explicitly de® nes the num ber of stations that m ust be
ability of the g roup and m ay result in a ® xed annual successfully managed in order to pass the test, constitutes a
percentage of failing students, if the scores maintain a normal conjunctive standard . Sim ilarly, test developers may decide
distribution across administrations. An absolute standard that candidates must exceed separately each skill standard
stays the sam e over multiple administrations relative to the (history, physical examination and communication) at the
content speci® cations of the test. The failure rate m ay vary total test level. This approach will also constitute a conjunc-
due to changes in the group’s ability, from one administra- tive standard, which does not allow candidates to compensate
tion to the other. for relatively poor performance on one or more skills. For
Nedelsky stated:
exam ple, poor perform ance on history should not be
The passing score is to be based on the instructor’ s compensated by high performance on physical examination
ju dg e m e nt of what con stitutes an ad equ ate and vice versa. However, multiple conjunctive standards
achievement on the part of the student and not on may produce multiple failures. In a long OSCE examina-
the perform ance by the student relative to his/her tion, even the better students may fail a station because of
class or to any other particular group of students. measurement error (Hambleton, 1995). In an OSCE a single
In th at sen se the stan dard to be used for station score is considered an unreliable score, mostly as a
determining the passing score is absolute. (Nedel- result of case speci® city and sampling error (C.P.M . Van
sky, 1954, p. 3) der Vleuten, G.R. Norman & E. De Graaf, Pitfalls in the
A standard is de® ned as absolute ª if it can be stated in Pursuit of O bjectivity, personal com munication, 1990).
term s of the knowledge and skills a student must possess in Acceptable test reliabilities are attained with careful choice
order to pass the courseº (Nedelsky, 1954, p.4). of optimal number of stations, which will allow generaliza-
tion to other clinical cases. Standards should be set on reli-
able scores to avoid problems of decisions inconsistencies
Com pensator y vs. conjunctive standards
and candidates’ misclassi® cations (Jaeger et al ., 1996). Thus,
conjunctive standards set on scores with low reliability at
Weaknesse s of relative/nor m -referen ced standards the station or test level may result in wrong pass/fail deci-
sions due to m easurem ent error. Issues of resitting the
x Standards are not content related examinations due to m ultiple failures and the logistics and
x A ® xed number of candidates may fail each year cost involved sh ould be considered carefully, prior to
x Examinees’ ability may in¯ uence the standard selecting a conjunctive approach. The degree of relationship
x Standard is not known in advance am ong the test skill components m ay further assist in
x Diagnostic feedback relative to performance is unclear conjunctive vs. compensatory decisions. The higher the
correlation among the test components the greater the
An example of an objective structured clinical examination inclinations towards a com pensatory standard. Highly
(OSCE) will help clarify the difference between compensa- correlated test components may imply a common construct
tor y and conjunctive standards. Let us assum e we are or dimension of perform ance, in which performance on one
required to set standards on a 15 stations standardized component impacts on performance on the other. The
121
M . Fr iedman B en-D avid
aggregation of related com ponent scores into one dimen- item s in a test employing a num ber of categories, for
sion score is the basis for a compensatory pass/fail standard. example: three levels of difficulty and four levels of relevance
If there is no or little relationship among test components to the decision to be m ade. In a m odi® ed Ebel procedure,
the conjunctive methods could be justi® ed.The com munica- judges categ orize item s as È ssential’ , Ìm por tant’ or
tion skill in this example m ay call for a compensatory Ìndicated’ (Case & Swansen, 1998). After classifying the
standard to be set for the total test and not separately for items into each category, judges then decide on the propor-
each station since skill scores do not vary signi® cantly from tion of item s in each category that a hypothetical group of
station to station. However, the history and physical examina- borderline exam inees could resp ond to correctly. The
tion standards will be set separately for each station and it is number of items in each category is multiplied by the propor-
the decision of the test developers whether the total test tion of item s in that category that would be answered
standard for each skill will be conjunctive or com pensatory. correctly by a borderline candidate, and these products are
For example, in an OSC E where history and physical summed to produce a test standard.
examination test component scores are highly correlated, The Nedelsky (1957) approach was originally designed
both com ponents m ay be regarded as one dimension of for m ultiple-choice items. For each item, the judges decide
data-gath ering skills. Setting standards on a com bined score on how m any of the distractors (response options) a
of data gathering constitutes a compensatory approach by minimally competent exam inee would recognize as being
which an examinee m ay com pensate for poor physical incorrect.

exam ination skills with good history skills. In another The m inimal pass level for the item is computed as one
example the two skills components may be uncorrelated but over the number of distractors (options) rem aining after the
a combined score presents higher reliability compared with incorrect options are rem oved. For example, on a ® ve-
the two unreliable, separate component scores. In such a option item , if a judge decides that the minim ally com petent
case psychometricians will explore issues of misclassi® ca- exam inee will recognize two options as being incorrect, the
tion in the aggregated approach (compensatory) vs. the minimal pass level (MPL) for this item will be 0.33, since
separated com ponent approach (conjunctive) (Jaeger et al ., three options remain after the two incorrect options were
1996). A conjunctive standard allows diagnostic feedback removed. The M PL is then summ ed over all items in the
to candidates, since each skill component is considered test to produce a passing score for the total test.
separately. In the compensatory approach information about Jaeger’s method (1982, 1989) differs from most other
test com ponents is aggregated by setting one standard. test-centred methods in a number of important ways (Kane,
Consequently, perfor m ance feedback on sepa rate test 1994), all of which tend to emphasize the role of standard
com ponents may not be available. setting as the development of policy rather than as a technical
approach to estimating a passing score. Jaeger (1989)
emphasizes the importance of recognizing the need to
Standard-setti ng m ethods
ª sample all populations that have a legitimate interest in the
outcom es of com petency testingº . The political aspects
C onjun ctive standard considerations involved in many testing contexts is emphasized as well as
the need to incorporate m ultiple points of view. Most of the
x Reliability of the test components at the station or other m ethods employ a single panel of expert judges. The
skills level might be low focus in Jaeger’ s method is on the passing examinees rather
x Long tests may produce multiple failures at the station than on the borderline or the minimally com petent. Judges
level are asked to rate `yes’ or `no’ for each item, whether each
x Standards on multiple-test components may also result exam inee who passes the test should be able to answer the
in multiple failures item correctly. This procedure is employed within a number
of expert panel groups. Each panel will receive input from
the other panel’ s ratings and will be presented with actual
Com pensator y standard considerations performance data. Judges are then asked to provide second
ratings or more, while reconsidering their judgem ents in
x What are the consequences for compensating for poor view of the new information. In such iterative procedures,
performance?
considerations of the validity of the standards are built into
x Loss of diagnostic feedback the process (Kane, 1994).
Test-centred m odels Examinee-centred m odels
Jaeger (1989) refers to test-centred models in which the In the examinee-centred approaches panellists m ake pass/
judges set standards by reviewing the test items and provide fail decisions by identifying a point on the score scale that
judgements as to the `just adequate’ level of performance would be most consistent with the test purpose (Kane,
on these item s. The following descriptions of the test- 1994).
centred m ethod accentuate the different ways for judges to In the Borderline-Group method (Livingston & Zieky,
provide their ratings. The Angoff (1971) model employs a 1982), the judges may use different approaches to identify
test-centred approach and is known for its wide use in an actual (not hypothetical) borderline group. The level of
educational testing and performance assessment.The present achievem ent of this group is around th e perform ance
guide will use the Angoff approach as a case study for standard. The judges may use their experience or other
setting standards. methods of assessm ent to identify the group. The median
Ebel’ s (1972) approach requests judges to categorize the score for this group could be used as the passing score. The
122
AM EE G uide No. 18
scores of the borderline group should cluster together to M odi® ed Angoff

produce a reasonable standard. If the scores are spread, the
The Angoff standard-setting approach is one of the most
method m ay not be applicable.
widely used in medicine (Cizek, 1996). In its purest form the
In the Contrasts by Group approach (Livingston & Zieky,
Angoff method is a judgemental approach in which a group of
1982), the panellists sort the examinees into two groups:
expert judges makes estimates about how borderline candidates
competent and not competent. The judgement is based on
would perform on items in the examination (Livingston &
characteristics of the examinees relative to the task other
Zieky, 1982; Berk, 1986), i.e. the proportion of borderline
than the test scores (i.e. the test scores are not known to the
examinees who will answer an item correctly.This is equivalent
panellist during the sorting process). After the sorting is
to estimating the candidate’s likelihood of answering a number
completed, the score distributions for the competent and
of items correctly (Sizmur, 1997). Estimates are averaged
not competent groups are plotted. Comm only, the point of
over judges and summed over items to create a standard
intersection of the two distributions could be considered as
(cut-off score). A modi® cation adds to the process by the
the passing score (Clauser et al. , 1996; Burrows et al ., 1999).
provision of data about difficulty of the items, based on actual
However, the overlapping areas of the distribution would
performance data. The focus on the borderline group should
indicate the degree of separation between the two groups.
enhance maximum discrimination at the borderline level.
The larger the overlapping areas, the smaller the separation
The Angoff method depends on expert judges’ (panel-
between the two groups. Ideally the chosen passing score

lists’ ) degree of fam iliarity with the minimally competent
should maximize discrimination between the competent and
test-taker (the hypoth etically borderline group), who is
the non-competent exam inees, thus the distributions should
neither quali® ed nor unquali® ed to pass the test (Norcini et
indicate minimal overlap.
al ., 1993). The panellists are asked to m ake judgem ents
about that candidate’s likelihood to respond correctly to
Relative/absolute com prom ise standards: the H ofstee m ethod each of the test items. This task is highly dependent on the
capacity of the panellists to understand the m eaning of a
This is a standard-setting approach that incorporates the
borderline candidate’s characteristics. The panellist also has
advantages of both relative and absolute standard setting
to understand how the borderline candidate will respond to
procedures (de Gruijter, 1985). Panellists review the test
the required performance task. The panellist must also be
materials and are asked to provide four values:
fam iliar with the expected level of examinee perform ance as
(1) lowe st ac cepta ble percenta ge of failing e xa m inee de® ned by the purpose of the test, and be able to understand
(minimum failure rate); the degree of difficulty of the task. Research with judges
(2) hig hest acceptable percentage of failing exa m inee showed that the standards that they demanded in theory
(maximum failure rate); were extremely stringent and different from those they actu-
(3) lowest score which allows a candidate to pass (minimum ally used in practice. In general, judges have a tendency to
passing point); produce very high standards (Wolf & Silver, 1986).
(4) highest score required for a candidate to pass (maximum This guide for standard-setting approaches will focus on
passing point). one m odi® ed version of the Angoff approach. In reviewing
the scoring sheet of an O SCE station, panellists will be
The median values of the group of judges are plotted for
asked to deter m ine the num ber of scoring points an
each value.
individual borderline candidate will answer correctly in order
In Figure 1, the m inim um failure rate is 0, the maximum
to pass the station. This process will be done separately for
failure rate is 20%, the m inimum passing point is 50%, and
each of the station’s skill components: history and physical
the m aximum passing point is 60%. The (actual perform -
ex am ination. A separate process will be outline d for
ance) test scores curve is plotted, which indicates the failure
comm unication skills.
rate as a function of the passing score. The intersection of
test scores curve with the diagonal line drawn from upper
left to lower right is the cut-off point (just above 55%
M odi® ed A ngoff
correct).
x Test-centred approach
x Criterion referenced
x Relies on expert (panellist) judgements
x Requires understanding of borderline characteristics
x Actual performance data are provided to panellists as
an additional information source
Selection of panellists
The importance of panellist selection for the creation of a

d efen sible stan dard -setting proc ess can not be ove r-
em phasized. H am bleton (1995) describes a defensible
process which is based on: selection of appropriate panel-
lists, excellent panellist training and a well-planned and
F igure 1. systematic standard-setting process, which provides am ple
123
opportunity for discussion and deliberations among panel- Since multiple-choice examinations are the most commonly
lists. The reliability of the performance standard will be used form of written tests, it may be helpful to consider
re¯ ected by the degree of agreement among panellists and differences between multiple-choice examinations (MCQ) as
a reasonab le perfor m ance standard will coincide with a form of written test and an OSCE as a form of performance
expectations about the speci® ed professional performance. test. Let us assume that the MCQ examination contains 400
M essick (1989, p. 15) refers to test validity as the ª infer- items, sampled across domains. The OSCE test contains 15
ences and actions based on test scoresº . The standard- standardized patient-based stations sampled across speciality
setting procedures are in fact enhancing the validity of the areas. Items on both forms of assessment are either a multiple-
test by establishing , through a defensible process, the choice question or an OSCE checklist criterion.The following
meaning and inferences that we m ake on the basis of test statements highlight the differences between the two forms of
scores (W iliam , 1996). Thus, standard-setting procedures assessment and questions are raised for further consideration
lie at the very heart of the validity of the test. A defensible with regard to the modi® ed Angoff approach.
approach to standard-setting procedures is essential to
(1) The traditional M CQ items are written independently
support the appropriateness of standard inferences and the
of each other. In contrast, a station item is dependent
consequential decisions based on those inferences.
on the other items in the checklist.Thus, panellist ratings
Jaeger (1991) identi® es the quali® cations of panellists.
for MCQs are provided separately for each item, whereas
Among other things, they should be experts excelling in

in the OSCE example items are either rated independ-
their specialization, able to conceptualize and problem-
ently or as `sets’ .
solve in their specialty and able to employ self-monitoring
skills. Jaeger’ s criteria for panellist selection may further be Q uestion:
developed to include quali® cations relevant to test speci® ca-
tions, purpose of the test and standard-setting methods In which way does the rating of individual items vs. sets
employed. For example, in setting standards for a 15 station of items impact on the estimated standard?
OSCE examination for fourth-year medical students who
are required to pass the examination for graduation quali® ca- (2) While estimating the difficulty of an MCQ item, panel-
tions, ad dition al pan ellist charac ter istics sh ould be lists consider the likelihood of candidates getting an
considered. Panellists must be fam iliar with performance of isolated know ledge com ponent right. In the O SCE
fourth-year students, should be familiar with the OSCE example the estimation is applied to a `set’ of items which
method of assessment, should have experience as clinical represent a `dimension’ of professional competence.
teachers and should have knowledge of the next phase of
training the graduates are expected to enter.
The developers of tests m ust create their own list of Q uestion :
panellists’ characteristics, which will speci® cally address the
Are the two estimation procedures similar? Which estima-
purpose of the examination. In a high-sta kes exam ination,
tion task is more difficult? Does the multidim ensional
panellists who are recognized for contribution in the ® eld
aspect of an OSCE station violate the unidim ensional
may be appropriate. In addition, an adequate mix of gender,
assumptions that underlie standard setting procedures?
age, seniority, academic vs. community experiences and other
relevant attributes will further establish the defensibility of
the standard-setting process. M ultiple panellist groups for (3) The large number of M CQ items (400) challenges the
rating the same stations will further add to the defensibility generalizability of the num ber of OSCE stations and
of the process by demonstrating consistency of standards the broad nature of the tests.The complexity of perform-
across occasions and across panellists. The larger the panel- ance tasks, in terms of resources and time, limits the
list sample the m ore stable will be the resulting standard test to a m anageable size.
(Jaeger et al ., 1996). However, one needs to consider how
manageable a large size group would be. Jaeger et al. (1996)
propose an estimation method for panellist group size by Q uestion:
considering the standard error of measurement of the assess- Does the standard constitute a generalizable pass/fail
ment tool, and the standard deviation of the panellist ratings. decision?
Panellist s sho uld be: (4) In multiple-choice questions the comm on response is
one correct answer. In an OSCE station the response
x Experts in the related ® eld of examination
m ay represent different degrees of success on a scale
x Fam iliar with the examination m ethods
continuum.
x Good problem solvers
x Fam iliar with level of candidates
x Interested in education (teachers) Q uestion:
Does the graded response key introduce further difficul-

ties in setting standards on performance tasks?
W ritten vs. perform ance standards
M ost of the standard-setting procedures employed for The above are only a few differences between M CQs and
performance assessment were ® rst applied to written tests. OSCE items. Since traditional standard-setting m ethods
124
AM EE G uide No. 18
are applied to performance assessment, the differences The key words are `minimal competence’ and `supervised
should be considered for the establishm ent of a valid programme in the UK’ . This statement of purpose will
stan dard. guide panellists in their estimates as to what is expected in
the UK concerning PRHO entry level skills, as well as what
is expected of the medical school graduates at their exit
E ducational b ene® ts of standard setting
point. The minim ally competent concept and the continuing
Faculty developm ent supervised training may, for example, direct panellists to
consider conjunctive vs. compensatory standards. Knowing
Standard-setting procedures could be employed as a form
that candidates are expected to continue with their training
of faculty developm ent. The use of orientation materials
in a supervised mode, and that poor perform ance in som e
such as videos, test forms, performance data etc. allows
areas will be further developed in the next phase of training,
faculty to experience ® rst hand information on candidates’
allows a compensatory approach. Alternatively they may
perform ance on the task. In some instances, faculty may
consider that basic minimal competence in each component
observe for the ® rst time videoed perform ance of students,
is essential for better future practice and thus a conjunctive
containing criteria for acceptable performance. They may
approach is warranted.
consider the information as feedback for their own clinical
In some instances, where faculty is not fam iliar with the
teaching. They may com pare poor and excellent candidates’
OSCE stations, a mini OSCE is set as part of the standard-

perform ance to their expectations. Som e may not `believe’
setting orientation procedure. Faculty may play the role of
that the poor performer is actually enrolled in their medical
examinees, or examiners, by observing each other. This is
school.The standard-setting process provides a reality check,
done to avoid overestimation or underestimation of the
which is not available otherwise owing to the low incidence
station difficulty. Their own level of performance as experts
of direct clinical observation in medical schools. They may
may serve as a `ceiling’ effect for the standard-setting ratings.
identify the existence of persisting weaknesses, for exam ple
However, panellists should be cautioned that as experts
bedside m anner. It is not uncommon to hear faculty say, ª I
they m ight perform short cuts, which will result in a lower
am going back to the department to change the way I teach
score for the station. The scores they achieved in the mini
bedside mannerº .
OSCE could serve as good points of discussions regarding
the panellists’ level of perform ance compared with the
Quality control of test m aterials expected level of the candidate’s performance.
The process of exposing faculty to test m aterials, scoring

policy and pro® les of scored performance constitutes a Step 2: Orientation to a `practice’ station
scrutinized quality-control procedure. Panellists in the
An example of the practice station is shown below, and the
process of reviewing test materials identify inappropriate
corresponding checklists are shown in Figures 2 and 3. The
item s, which are either ambiguous or irrelevant. They may
practice station is divided into two stations, history and
® nd tasks for which the criteria are incomplete. As an external
physical. In each station an examiner observes the candidate
group of experts, they will compare video performance with 1
and rates his/her performance on the checklist items.
the actual scoring results, and will comment on discrepan-
cies. Satisfying results of standard-setting procedures suggest
that test developers should submit test materials for which Instructions for C andidates
they have 100% con® dence in its quality and its scoring
mechanism. Otherwise, they may lose panellists’ trust in the H istor y taking
test. In actual standard-setting procedures panellist are asked M r/M rs J is a patient who com plains of easy bruising.
to refrain from comments on the design of the station or the You have four and one half m inu tes to take relevant
scoring policy. However, mistakes and inconsistencies, if history from her.You will be asked to present a sum mary
present, create serious complications in the process of setting of your ® ndings at the next station, including a working
stan dards. diagnosis.
Practical steps of the M odi® ed A ngoff A pproach

Instructions for C andidates
Step 1: General orientation
Physical exam ination
The facilitator presents to panellists a general overview of
the OSCE test, the duration of each station, test components, At this station there is an observer and a patient. The
scoring methods and any other information from which the patient has arterial peripheral vascular disease. Carry
panellists may bene® t. Central to the orientation session is out an exam ination of the circulation of the lower
a statement regarding the purpose of the test. In the O SCE ex trem ities of the patient. W hile doing this, please
example, the purpose may read as follows: explain to the observer what you are doing.
The g raduates of this m edical school should
demonstrate m inimal competence in history, physical Test developers present the `practice’ stations to panellists.
examination and doctor/patient communication, The practice stations materials are discussed. At the end of
fo r th e purpo se of e nte rin g a su per v ised the presentation, panellists are asked to make judgements as
Pre-Registration House Officer (PRHO ) training to how m any item s should be answered correctly by the
programme in the UK! borderline candidate in order to pass the `practice’ stations.
125
E xam iner’s checklist

Please ask the student for an identity label to place on the checklist
Using the marking schedule given below, allocate a mark by (± ) in the appropriate column to the
right of the sheet as carried out by the candidate.
0+ Not done / ½= Done poorly/1= Done satisfactorily
Explanation to patient 0 ½ 1
Position patient on couch 0 ½ 1
Inspection for evidence of skin changesÐ pallor 0 ½ 1
Inspection for evidence of skin changesÐ hair loss, dryness etc 0 ½ 1
Palpation for temperature changes 0 ½ 1
Palpation of pulses
Femoral 0 ½ 1
Popliteal 0 ½ 1
½
Posterior tibial 0 1
Dorsalis pedis 0 ½ 1
Capillary re® ll (time) 0 ½ 1
M ax 10 m arks
Figure 2. Practice station: History checklist.
E xam iners’ checklis t
Please ask the student for an identity label to place on the checklist
Students have been given the following instructions

M r/M rs J is a patient who complains of easy bruising. You have four and a half m inutes to take a
relevant history. You will be asked to present a summary of your ® ndings at the next station, including
a working diagnosis.
Please ask the student to summarize the history he/she has taken at the last station.
Note: prompting halves the m arks.
Use the marking schedule given below and tick the item in the appropriate box.
Not done Prom pted N ot prom pted
Name and age of patient 0 ½ 1

Duration of bruising 0 1 2
Site of bruising 0 ½ 1
Precipitating factors (trauma/spontaneous) 0 1 2
Other bleeding sites (guts, urine, joints, menorrhagia) 0 1 2
Severity of bleeding 0 ½ 1
Previous operations 0 ½ 1
Previous dental extraction’ 0 ½ 1
Family history 0 1 2
Drugs aspirin 0 ½ 1
Anticoagulants 0 ½ 1
Other 0 ½ 1
Sequence 0 1 2
Working diagnosis 0 1 2
F igure 3. Practice station: Physical exam ination checklist.
The `practice’ orientation materials may include: (5) the actual skill score will be presented to the panel-
lists following the com pletion of each video perform -
(1) a full description of the stations;
ance.
(2) history and physical examination checklists (Figures 2
and 3); The low vs. the high performers assist the panellist to form
(3) videotapes of one low performer and one high performer a range of possible performance pro® les for the practice
for the practice stations; stations. The orientation may contain other com ponents
(4) the panellist will have a blank checklist for the two according to the nature of the test. A careful process of
component skills while viewing the video; orientation may assist the panellists in arriving at `realistic’
126
AM EE G uide No. 18
judgem ents, and further facilitates a defensible standard- rating form will be used for the entire standard-setting
setting process. procedure. Two separate forms will be distributed, namely
(1) history and (2) physical exam ination. Each form will
Step 3: Characteristics of borderline candidates specify in the left-hand column the alphabetical order of the
stations to be rated. The next column will outline the
The importance of understanding the characteristics of maximum number of points available for the station. The
borderline candidates has already been emphasized. In the next two columns are provided for the ® rst and second
m odi® ed A ng off approach th e bo rd erline c andid ate s ratings.
constitute a hypothetical group. Therefore, it is essential to In the ® rst ratings, panellists are asked to enter their
introduce a process by which the panellists will increase individual judgement as to how many item s will be answered
their understanding of the borderline characteristics. correctly by a borderline exam inee in order to pass the
This process becomes more meaningful following the history skill component in this station. (Note: it is recom -
orientation steps in which panellists viewed performance mended that the number of items will constitute the ratings
pro® les as they understand the purpose of the test and are and not percentages of items. Experience has shown that
familiar with the test m aterials. During this step, a facili- panellists m ay be in¯ uenced by traditional concepts of
tator poses the question to the panellists: acceptable standards expressed in percentages [for example,
Please indicate the characteristics of borderline 70% vs. 50%]. Thus, it is m uch `safer’ to count the items,
candidates that are relevant to the skills measured and provide raw num bers as estim ates.) Following the
in this test. These are fourth-year medical students completion of the ® rst rating, the facilitator presents all
who are neither unquali® ed nor quali® ed to pass panellists’ ratings on the board by assigning a num ber to
the test. These candidates’ scores lie around the each panellist.The panellists discuss their ratings. The facili-
estimated cut-off point. It is important that we tator encourages panellists with the highest and lowest ratings
characterize the group in order to estimate and to re¯ ect on their judgements. The facilitator will average
make judgements about their ability to demonstrate the ratings of the panellists to produce a cut-off raw score
minimal competence in the practice station. for history in the practice station.
Panellists may then write independently the characteristics
of the borderline group per skill component. The panellists’ Provision of actual perform ance data
statements are then posted and the facilitator discusses with
A history score distribution for the practice station is

the panellists each statem ent. Argum ents and disagree-
presented to panellists (Figure 5). The score distribution is
ments are clari® ed and the group reaches a consensus as to
generated from previous adm inistrations of the station to a
what wou ld be an appro pr iate list of borderline
sim ilar c ohor t of four th -year m ed ical stud en ts. The
characteristics per skill component. This process is essential
characteristics of the cohort such as gender, year, purpose
to facilitate further the estimates and judgements of the
of administration etc. should be presented to the panellists.
panellists and to support the defensibility of the process.
Knowledge of the similarities or differences between the
cohort and the potential candidates allows panellists to
Step 4: Panellists provide ratings
evaluate the appropriateness of the actual performance data.
A rating form is distributed to panellists (Figure 4). The The distribution outlines the cum ulative num bers and
Panellist
History
Station Max. pts 1st ratings 2nd ratings
A (Practice) 20 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
B 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
C 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
D 30 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
E 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
F 34 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
G 31 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
H 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
I 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
J 30 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
K 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
L 34 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
M 31 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
N 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
O 28 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
P 30 Ð Ð Ð Ð Ð Ð Ð Ð Ð Ð
Figure 4. Panellist rating form.
127
H istor y scores
Score No. Cum.No. % Cum .%

* 0 2 2 1.3 1.3
1 0 2 0.0 1.3
2 0 2 0.0 1.3
3 0 2 0.0 1.3
4 0 2 0.0 1.3
* 5 1 3 0.6 1.9
6 0 3 0.0 1.9
* 7 1 4 0.6 2.6
* 8 1 5 0.6 3.2
* 9 2 7 1.3 4.5
***** 10 7 14 4.5 9.1
***** 11 8 22 5.2 14.3
********* * 12 16 38 10.4 24.7

********* ** 13 17 55 11.0 35.7
********* ****** * 14 25 80 16.2 51.9
********* ******* * 15 26 106 16.9 68.8
******** 16 13 119 8.4 77.3
********* ***** 17 22 141 14.3 91.6
****** 18 10 151 6.5 98.1
** 19 3 154 1.9 100.0
20 0 154 0.0 100.0
Notes: Total number = 154. Scores are rounded to the next whole number
F igure 5. Practice station.

percentages of students who got one item correct, two item s Ratings of comm unication skills
correct, three items correct etc. The facilitator indicates the
As stated earlier, the comm unication skill score did not vary
percentages of students who might fail the history skill
substantially from one station to the other. Therefore, setting
com ponent of the station if the panellist average ratings are
a communication standard by employing a holistic approach
applied to the distribution as a cut-off score. A discussion
may be appropriate for this particular situation. Panellists
should revolve around the issue of the `consequential data’ ,
do not consider a standard for each station, but rather set a
i.e. percentage failure. Are panellists surprised? Should they
point of `how much is enough’ on the scoring scale for the
expect a lower or a higher num ber of failures? Does the
total test.
percentage failure seem to coincide with their experience?
Orientation . Panellists are presented with test m aterials
The purpose of the actual perform ance data is to orient the
panellists to `realistic’ norm ative information. This provides relevant to the com munications com ponent of the test. This
another source of inform ation, which helps panellists adjust includes the theoretical framework, the criteria, the training
their ratings, when they are asked to attempt a second rating materials for the assessors (in this case the standardized
on the form (see Figure 4). patients) and the scoring scale. Panellists will view vide-
As stated earlier, the higher the agreem ent among panel- otapes of a range of perform ances (high, middle and low)
lists, the m ore reliable is the standard. Therefore, it is hoped and the application of the scoring scheme to the perform-
that in the second ratings, the panellists will adjust the an ce will be dem on strate d. T he bord erline g roup
scores in view of their peers’ ratings and the actual perform- characteristics are discussed and listed with relation to the
ance data. It is also hoped that the standard deviation of the communication skills. Panellists will then be asked to state
ratings will decrease at the second ratings. Panellists provide what will be the point on the scale at which the borderline
their second ratings, which are posted on the board by the candidate will demonstrate minimal competence for passing
facilitator. A ® nal cut-off score is calculated by averaging all a station with no reference to a particular station. Inherent
the ratings. The same process is repeated for the physical in this standard-setting approach is the com pensator y orienta-
examination checklist with a separate rating form for physical tion. The stability or consistency of the communication
examination. skills com ponent over stations (for example, Cronbach a
The same process is repeated for each station and its skill reliability 0.85± 0.90) justi® es such an approach. In this case
com ponent in the test. This includes high and low video panellists are not concerned with com pensating poor
perform ances, orientation to the checklist and to the scoring. performance in one station with high perform ance on the
To maximize the outcome of the standard-setting procedures other since performance is generally stable over stations.
it is possible to divide, after the orientation, a large group of The ® rst ratings are presented and averaged across panel-
panellists (i.e. 18) into three groups of six each. The groups lists. Actual performance data are presented and discussion
will set standards on different stations but one or two stations revolves around the rate of failure relative to the established
will be rated by all. This is done to exam ine the consistency standard. Panellists are then asked to provide a second
of ratings among panellist groups. rating. The second rating will constitute the ® nal standard
128
AM EE G uide No. 18
for th e com m unication skills com ponent. C andidates’ lenged for their arbitrariness and for the unstable standards
communication scores for each station will be averaged to over occasions, over panellists and over methods (Jaeger
produce a total test comm unication score. Candidates who et al., 1996). However, the growing need for establishing
exceed the standard will pass the communication component standards of candidates’ performance generates innova-
of the test. tive and creative procedures for standard setting, mainly
for performance assessment.
E valuation · The length of procedures (for the O SCE it may take two
days of panellists’ work), should also be considered and
T he evaluat ion co m po ne nt of th e sta n dard -sett ing
ways to shorten the process are needed.
procedures is important from a number of perspectives:
· The internal process of standard setting m ay incorporate
(1) To support the validity of panellists’ inferences and a built-in mechanism to identify test com ponents that
actions by demonstrating a defensible process. `behave’ differently or may result in òdd’ standards. This
(2) To demonstrate the reliability of the ratings by indicating could serve as an opportunity to identify test materials
a decrease in the variability of ratings. This means that with `¯ aws’ . Comparison with the other test components
more consensus was gained through discussions of may assist in rectifying the problem (Clauser et al ., 1996).
ratings and feedback from the other panellists. During · Further consideration m ust be given to fully compensa-
the process of discussions, the panellists identify their tory models (Clauser et al ., 1996; Burrows, 1999) in which
own stringent or lenient tendencies and m ay correct test items or components are averaged to produce a test
their ratings in the iterative process of multiple ratings. standard. Is the averaging an adequate procedure? Is the
(3) To ® nd out the effect of the normative inform ation on total test standard a simple average of the test component
the panellists’ ratings and ensure that the criterion- standards, or should it be a weighted average, or any
references approach did not shift to a norm-referenced other form of aggregation?
approach owing to high in¯ uence of the normative data. · Obtained standards should be checked against other
(4) To ® nd out the effectiveness of the orientation material information available on the test taker to ensure the validity
and process for determining the extent to which the and the reasonableness of the standard.
panellists changed their previous conceptions as a result · Effective methods of training panellists to recognize
of their orientation experience. borderline characteristics are essential if the Ango ff
(5) To determine the extent to which the panellists were approach is widely used.
con® dent in the process and the resulting standards. · The more standard-setting procedures are applied to a
Panellists’ con® dence in the standard is another indica- variety of tests, the more we will enhance the practice of
tion of the validity of the process. A broad representa- high-quality testing, and the higher will be the confidence
tive sample of panellists constitutes collective expert in the testing of professional competences.
judgements for justifying a standard. This is another · As Jaeger et al. (1996, p. 80) state: ª the state of the art, ( of
aspect that supports the defensibility of the standard- standard setting for performance assessm ent ), is far from a
setting approach. state of grace. Much work remains to be done.º
(6) To obtain feedback from panellists as to how the process
might be improved.This is an important aspect of evalu- N otes on contributor
ation since standard-setting approaches in performance
M IR IAM F RIEDM AN B EN -D AVID , form erly Co-Director of the Clinical
assessment in the medical profession are still in an
Skills Certi® cation Program at the Educational Comm ission for
evolutionary stage. Any feedback on the improvem ent
Foreign M edical Graduates, is currently a visiting professor at the
of the process may assist in the construction of improved Centre for M edical Education at the U niversity of Dundee. H er
approaches to setting standards. As Linn et al. (1982) m ain area of research is the evaluation of physician com petences, in
state: ª Standard setting m ethods depend on human undergraduate and postgraduate education.
judgements and there is no single perfect method of
setting themº .Therefore any improvement in the process N otes
will maximize the bene® ts.
[1] The practice case was drawn from the University of
Evaluation materials should include data on the ® rst and Dundee Medical School Clinical Skills Centre O SCE
se cond ratings o f th e panellists for e ach of th e test bank.
com ponents rated, which should demonstrate increased
consensus of raters. It should also include a questionnaire
R efe rences
adm inistered to panellists at the end of the standard-setting
process. Test developers should design the questionnaire A NG OF F , W.H . (1971 ) Scales, norms, and equivalent scores, in: R.L.
according to the test needs with consideration of the six T HOR NDIKE (E d.) Educational M easurement, 2nd edn, pp. 508± 600
(Washington, DC, American Council on Education).
points stated above. In high-stakes examinations the aspect
B E RK , R .A . (1 986 ) C onsum ers’ g uide to se tting pe rform ance
of evaluation could not be overemphasized. However, even standards on criterion reference tests, Review of Educational Research,
in other settings such as medical school or courses the 56, pp. 137± 172 .
defensibility of the process must be demonstrated through B U RROW S , P.J., B ING HAM , L. & B RAILOV SKY, C.A. (1999 ) A m odified
the evaluation process. contrasting groups m ethod used for setting the passm ark in a sm all
scale standardised patient examination, A dvances in H ealth Sciences
C oncluding com m ents Education, 4, pp. 145± 154.
C ASE , S.M . & S W ANSEN , D.A. (1998 ) Constr ucting W ritten Test Ques-
· M uch work is still needed to establish effective standard- tions in the B asic Sciences Philadelphia, PA, National Board of
setting procedures. Often, these procedures are chal- M edical Exam iners, 3750 M arket Street, PA 19104) .
129
C IZEK , G.J. (1996 ) Standard setting guidelines, Educational M easure- w ith passing scores, Review of Educational Research, 64(3), pp.
ment Issues and Practice, 15, pp. 12± 21. 425± 461.
C LAU SER , B.E ., C LYM AN , S.G., M ARG OLIS , M .J. & R OSS , C.P. (1996 ) L IN N , R .L., M A D AU S , G .F. & P E D U L LA , J.J. (1982 ) M inimum
Are fully com pensatory models appropriate for setting standards competency testing cautions on the state of the art, Am er ican Journal
on perform ance assessm ent of clinical skills?, Academic M edicine, of Education, 91, pp. 1± 35.
71, pp. 590± 592. L IVINGSTON , S.A & Z IEKY , M .J. (1982 ) Passing Scores: A Manual for
D E G R U IJT ER , D. (1985 ) C om prom ise m odels for establish ing Setting Standards of Performance on Education al and Occupatio nal
exam ination standards, Journal of Educational M easurement, 22, pp. Tests (Princeton, NJ, Educational Testing Services).
263± 269. M ESSIC K , S. (1989) Validity in Education al M easurement, 3rd edn, Ed.
E BEL , R.L. (1972 ) Essentials of Education al Measurement (Englewood R.L . L IN , pp. 3± 105 (N ew York, Am erican Council on Education/
Cliffs, NJ, Prentice-H all). Macm illan).
H AM BLETON , R.K. (1995 ) Setting standard on perform ance assess- M U IJTJENS , A.M .M ., H OOG ENBO OM , R.J.I., VERW IJNEN , G.M . & VAN
ments promising new methods and technical issues, paper presented D ER V LEU TEN , C.P.M. (1998 ) Relative or absolute standards in
at the m eeting of Am erican Psychological Association, New York, assessing medical knowledge using progress tests, Advances in Health
August 1995. Sciences Education, 3, pp. 81± 87.
JA EG E R , R.M . (1982 ) An interactive structures judgem ent p rocess N EDELSK Y , L. (1954 ) Absolute grading standards for objective tests,
for es tab lishing standards on com p etency tests: theor y and Education al and Psychological M easurement, 14, pp. 3± 19.
application, E ducational E valuati on and Policy A nalysis, 4, pp. N ORCINI , J.J., S TILLM AN , P.L., S U TNICK , A.I., R EG AN , M.B., H ALEY,
461± 476. H.L.,W ILLIAM S , R.G. & F R IEDM AN , M. (1993 ) Scoring and standard
JAEG ER , R.M . (1989 ) Certi® cation of student com petence, in: R.L. setting with standardised patients, Evaluation and The Health Profes-
L IN N (Ed.) Educational Measurement, 3rd edn, pp. 485± 514 (N ew sions, 16, pp. 322± 332 .
York, American Council on Education and Macm illan). N U NGESTE R , R.J., D ILLON , G.F., S WANSON , D.B., O R R , N. A. &
JAEG ER , R.M . (1991 ) Selection of judges for standard setting, Educa- P OW E LL , R .D. (1991 ) Standard setting p lans for the N BM E
tion M easurement Issues and Practice, 10(2), pp. 3± 6. Comprehensive Part I and Part II examinations, Academic M edicine,
JAEGER , R.M ., M ULLIS, I.V.S., B OURQU E , M.L. & S H AKRANI , S. (1996) 66(8), pp. 429± 433.
Setting perform ance standards for perform ance assessm ent: som e S IZM U R , S. (1997 ) Look back in Angoff: a cautionary tale, B ritish
fundam ental issues, current practice, and technical dilem m as, in: Education al Research Jour nal, 23, pp. 3± 13.
Technica l Issues in Large-Scale Performance Assessment, NCES 96± 802 W ILIAM D. (1996 ) M eanings and consequences in standard setting ,
(U S D epartm ent of Education, Office of Educational Research Assessment in Education, 3, pp. 287± 296.
and Improvement). WO LF , A. & S ILVER , R. (1986) Work-based Lear ning:Trainee Assessment
K AN E , M . (1994 ) Validating the perform ance standards associated by Super visors, MSC R&D Series No. 33 (Sheffield).
130

AMEE Guide No. 18: Standard Setting in Student Assessment: Miriam Friedman Ben-David

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

AMEE Guide No. 18: Standard Setting in Student Assessment: Miriam Friedman Ben-David

Caricato da

Copyright:

Formati disponibili

M edical Teacher, Vol. 22, No.

AMEE Guide No. 18: Standard setting in student

MIRIAM FRIEDMAN BEN-DAVID

consideration for most standard-setting approaches. Whilst

Historically, standard-setting methods were most commonly

which an examinee m ay com pensate for poor physical incorrect.

Test-centred m odels Examinee-centred m odels

scores of the borderline group should cluster together to M odi® ed Angoff

between the two groups. Ideally the chosen passing score

The importance of panellist selection for the creation of a

Among other things, they should be experts excelling in

Does the graded response key introduce further difficul-

OSCE stations, a mini OSCE is set as part of the standard-

The process of exposing faculty to test m aterials, scoring

Practical steps of the M odi® ed A ngoff A pproach

E xam iner’s checklist

Figure 2. Practice station: History checklist.

E xam iners’ checklis t

Students have been given the following instructions

Name and age of patient 0 ½ 1

F igure 3. Practice station: Physical exam ination checklist.

A history score distribution for the practice station is

Station Max. pts 1st ratings 2nd ratings

Figure 4. Panellist rating form.

Score No. Cum.No. % Cum .%

********* * 12 16 38 10.4 24.7

F igure 5. Practice station.

Potrebbero piacerti anche