Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SUMMARY
Among the areas of work in testing, both in and out of medical education, standards
and standard-setting remain relatively unsettled. In part this is due to the arbitrary
nature of standards, but it is also a result of confusion over terminology (Norcini,
1992). One writer's absolute standard is another's criterion-referenced scoring and
both have been called content-based approaches. In contrast, there is norm-
referenced scoring, a term which is frequently used interchangeably with relative
standards.
As a consequence of this lack of consensus about terminology and the confusion
it causes, it is not possible to describe standards and approaches to standard- setting,
without first defining terms. Therefore, using a previously devised framework, this
chapter begins with a definition of scores and standards and then describes the types
of score interpretation and standards (Norcini, 1992).
811
International Handbook of Research in Medical Education, 811-834.
G.R. Norman, C.P.M. Van der Vleuten, D.I. Newble leds.}
2002 Dordrecht: Kluwer Academic Publishers.
812 Norcini and Guille
Scores
A score is a number or letter that represents how well an examinee performs along
the continuum of the construct being assessed. In medicine, it is the degree of
medical correctness for a response or group of responses. A score is the numerical
answer to the question, "How good is the examinee's performance from the
perspective of the patient?"
Scores can often be based on the actual responses of the examinees, and multiple
choice questions (MCQs) are a good example of this. Since such items are written
to have one clearly correct response, the score on each item is dichotomous (0 or 1),
and performance on the entire examination can be captured by simply counting the
number right.
In item formats that attempt to reproduce complex clinical situations with high
fidelity, scores may involve some form of weighting and they are often based on an
interpretation of the responses of the examinees. For example, in an oral
examination with live patients, scoring might use weights when there is a desire to
capture degrees of correctness. Moreover, this type of examination will likely
require interpretation on the part of the examiner because several courses of action
could lead to the same patient outcome and the connection between an examinee's
behavior and the patient's responses may not always be obvious. This requires the
examiner to interpret the examinee's actions in the broad context of the interaction
and make judgments about whether they are correct.
Regardless of whether it is weighted or unweighted, based on the actual
responses of examinees or an interpretation of those actions, a score reflects the
degree of medical correctness from the perspective of the patient.
Standards
A standard is a statement about whether an examination performance is good
enough for a particular purpose (e.g., to provide safe and effective care). It is
expressed as a special score that serves as the boundary between those who have
met the standard and those who have not.
Standards focus on the examinees' performances and judge them against a
specific social or educational construct (e.g., the competent practitioner, the student
ready for graduation). Therefore, standards are based on judgments about how
much the examinee needs to know for a certain purpose, rather than the patient
outcomes that form the basis for scoring. For example, the scoring of a standardized
patient examination should not change whether it is taken by fourth year medical
students or third year residents - the right things to do for the patient are
independent of the examinees. However, the standard should differ depending on
whether the test is being used to identify students who satisfactorily complete
medical school or third year residents who are finishing postgraduate training (i.e.,
it makes sense to require higher performance from the latter).
Combining Tests and Setting Standards 813
Scores
Scores can be interpreted from a norm-referenced or domain-referenced
perspective. They are interpreted from a norm-referenced perspective when they
provide information on how an examinee performs against others who took the test.
For instance, when a rank or percentile is reported, it tells the examinee how he or
she did relative to the other examinees. It does not provide information on how
much of the content or domain the examinee knows.
In contrast, scores are interpreted from a domain-referenced perspective when
they provide data on an examinee's performance relative to the test content. For
instance, scores are often reported as the percentage of questions an examinee
answered correctly. This provides the examinee information on how much of the
material he or she knows. It does not provide information on how the examinee
performed relative to the other examinees.
Standards
Standards are relative or absolute. Standards are relative when the decisions are
based on a comparison among the performances of examinees. For example, in the
past it was routine practice for many licensing and certifying boards in the United
States to set the pass-fail point at one standard deviation below the mean score of a
reference group. Examinees needed to perform just better than the worst 15% of the
reference group.
Standards are absolute when the pass-fail decision is based on how much the
examinee knows. For example, to pass a test an absolute standard might require
examinees to know 70% of the material or correctly answer 70 of the 100 questions.
814 Norcini and Guille
Standard-setters
Standards are always set by people and their work needs to be sensitive to the
purpose of the test, the nature of the test material and the quality of the examinees.
The best way to ensure these sensitivities is to pick the standard-setters with care.
In a low stakes testing situation, such as an examination at the end of a course,
the standard-setter is typically the faculty member who teaches the course. This
traditional method of setting standards has some serious drawbacks. The faculty
member has a conflict of interest in that he or she is responsible for both the
education and the evaluation of the examinees. Moreover, a standard set by a single
person is likely to vary in stringency over time and course content. Despite these
problems, the faculty member represents all of the important considerations in
setting the standard, since he or she knows the purpose of the test, understand the
content, and is familiar with the examinees. In addition, this time honored method
of standard-setting is efficient and it has credibility with students and the public.
In a high stakes testing situation, setting standards is more complex and it would
not be credible to rely on a single faculty member. Using certification as an
example, it is essential to include a number of judges in the standard-setting
exercise. They will improve the reproducibility of the standards and reduce the
effects of differences in stringency over time and test forms. Likewise, it is essential
to ensure racial, ethnic, geographical, and gender balance and to avoid the inclusion
of any participants who have a conflict of interest or who would detract from the
defensibility of the standard.
The exact mix of qualifications in the group of individuals who set standards for
a high stakes certifying examination is of considerable importance. Clearly all of
those involved in the standard-setting process should be knowledgeable in the
content area of the examination and probably certified in it. Academic appointments
are valuable because they imply that some of the standard-setters are involved in the
educational process, and so are knowledgeable about the competence of the
examinees. Engaging practitioners is also of importance since they practice the
profession and thus are experts in the content of the test. Finally, some of the
participants should be recognized as leaders within the profession, since it will lend
credibility and defensibility to the final result.
Combining Tests and Setting Standards 815
There are many thorough reviews of the advantages and disadvantages of the most
popular standard-setting methods (e.g., Berk, 1986; Meskauskas, 1976; Shepard,
1980, & 1984; Livingston & Zieky, 1982) and brief descriptions of several methods
follow this section. Although these are very helpful, there is no definitive and
substantive way to make a distinction among methods because, in the end, all
standards are arbitrary. Therefore, the exact method chosen to set standards is less
important than the fact it meets several criteria designed to lend credibility to the
final result. Specifically, the method should (1) produce standards consistent with
the purpose of the test, (2) rely on informed expert judgment, (3) demonstrate due
diligence, (4) be supported by a body of research, and (5) be easy to explain and
implement.
816 Norcini and Guille
Methods that produce standards consistent with the purpose of the test
Over the past decade, absolute standards have been preferred for most high stakes
tests of competence (Norcini, 1999). There are at least two reasons for this. First,
relative standards pass or fail examinees without regard to how much they actually
know. This is at odds with the purpose of competence testing and it lacks credibility
with patients, physicians, and students. Second, relative standards will vary over
time with the ability of the group of examinees. Tn certification settings, this
undermines the meaning of the test and creates "vintages" of certified specialists,
making it important to know when the test was passed in addition to knowing
whether it was passed. This is precisely what occurred through the 1980s with the
certifying examination in internal medicine (Norcini, Maihoff, Day, & Benson,
1988). The pass-fail score was set annually at one standard deviation below the
mean of a reference group. As the ability of that reference group slipped throughout
the decade, so did the standard.
Despite these concerns, there are situations in which relative standards are more
appropriate than absolute standards (Norcini, 1999). When a high stakes
examination is used to identify a certain number of the best or worst examinees,
relative standards are preferable. Tests used in admissions and placement are
examples of this type of application. Low stakes testing, such as that done at the
level of the classroom, is also appropriate for relative standards. In this situation,
test quality and difficulty vary over time, the ability of the examinees changes, and
the standard is often based on the judgment of one person. Relative standards have
the advantage of removing the effects of changes in test difficulty over time and
thus will produce decisions that are more consistent.
Regardless of which type of standard is more appropriate for a particular testing
situation, the method for setting standards should be consistent with it.
Lockwood, 1980; Kane & Wilson, 1984). Although not necessary routinely, some
studies have actually compared the estimates of judges over groups and time
(Norcini & Shea, 1992). They found that the standards were very similar over
different groups (usually varying by only 1-2%) and for the same judges on two
occasions, 2 years apart.
Third, it is important that the method for setting standards not foster the influence
of unrelated social factors (Fitzpatrick, 1989) or non-substantive differences in the
level and type of expert specialization (Busch & Jaeger, 1990; Norcini, Shea, &
Kanya, 1988). Studies of Angoff's method reveal that it is robust in this regard.
Standards are similar over different groups of judges (Norcini, Lipner, Langdon, &
Strecker, 1987; Norcini & Shea, 1992) and work in medicine and teaching indicates
that, within broad definitions, level and nature of specialization does not matter
(Norcini, Shea, & Kanya, 1988; Busch & Jaeger, 1990). When there are concerns
about particular judges, van der Linden (1982) and Kane (1987) have proposed
methods that determine inconsistent or biased performance.
Finally, the method should be sensitive to differences in test difficulty and
content. Therefore, it is not surprising that, for Angoff's method, there is
considerable evidence of a correlation between the judgments of the standard-
setters and actual item difficulty. Likewise, Shea, Reshetar, Dawson, and Norcini
(1994) found that the standard-setters' judgments varied by the nature of the content
as well as by item difficulty.
Realistic outcomes
If a standard produces outcomes that are not realistic (i.e., broadly consistent with
the expectations of the examinees and users), it will not be viewed as credible
regardless of the standard-setters and the method they used. Given that there are
legitimate alternatives to any standard, building a case for one of them requires
evidence that it (1) is viewed as correct by stakeholders, (2) produces pass rates that
have reasonable relationships with contemporaneous markers of competence, and
(3) is related to later performance.
Gathering the opinions of stakeholders about the "correctness" of a standard
forms the most direct test of how realistic it is. Examples of this type of work with
physicians and nurses include Norcini, Shea, and Webster (1986), Fabrey and
Raymond (1987), and Orr and Nungester (1991). For instance, Norcini et al. (1986)
focused on the internal medicine certifying examination and surveyed groups of
examinees, practicing physicians, directors of training programs, nurses, and
hospital administrators. Their responses reflected an accurate understanding of the
pass rates and support for the "correctness" of the standard (78-89% of the
respondents thought the standard was "about right").
A comparison of the pass rates of various groups with a set of expectations about
performance, what Kane (1994) calls "validity checks based on external criteria",
constitutes the second form of evidence for realistic outcomes. A match between
performance and expectations will add credibility to the result. Mismatches by
themselves will not invalidate a standard, but they raise questions that need to be
addressed, and they identify a group likely to mount a challenge to the examination.
For example, if a clinical skills examination is administered at the end of medical
school, there are at least two pass rates that would be of interest. First, it is
reasonable to expect that the educational process has been effective and only a few
students would fail the examination. If the total group pass rate is low, it may reflect
a problem with the standard (or the education) and it merits additional attention.
Second, it is reasonable to expect that students who have done well in their
clerkships should also do well on the examination. Consequently, there should be a
very high pass rate among the best students based on clerkship grades and a lower
(but still high) pass rate for those students with lower clerkship grades.
Finally, there should be a relationship between a passing or failing performance
and later educational and practice outcomes. Although the types of study that would
support this notion have enormous intuitive appeal, they are very difficult to do well
(Shimberg, 1981; Kane, 1994). There are few well defined educational or practice
outcomes and measuring them poses several serious problems.
Nonetheless, an example of this type of evidence can be found in the work of
Ramsey, Carline, Inui, Larson, LoGerfo, and Wenrich (1989) who compared
certified and non-certified internists on a variety of educational and practice
measures. Certified internists had some advantages in preventive services, patient
outcomes, and measures of medical knowledge and judgment. Although this study
820 Norcini and Guille
had some of the limitations associated with this type of research, it exemplifies an
important approach that has broad appeal and lends considerable credibility to the
result.
There are a variety of methods for setting standards and a review of all of them is
beyond the scope of this chapter. However, they fall into four broad categories:
relative standard-setting methods based on judgments about groups of test takers,
absolute standard-setting methods based on judgments about test questions,
absolute standard-setting methods based on judgments about the performance of
individual examinees, and compromise methods (Livingston & Zeikey, 1982). For
each of these categories, this section of the chapter will identify the most popular
methods, outline the steps in the process, and list their advantages and
disadvantages. Livingston and Zeikey (1982) provide a more thorough treatment of
these methods.
Methods
In the two most popular of the relative standard-setting methods, the experts make
judgments about the percentage of a group of test takers who are qualified to pass.
In the fixed percentage method, the judgments are made about the total group of test
takers. In the reference group method, judgments are made about a specific
subgroup.
Methods
In these methods, the standard-setters make judgments about the qualifications
(e.g., pass or fail) of several individual examinees based on a review of their
performance on the entire test. Two of the more popular of these methods are the
Contrasting-Groups method and the Up-and-Down method.
4. Select a sample of examinees in the region where the cutting score might be.
5. Give the judges the responses of one examinee to the entire test.
6. Ask the judges to decide (by consensus or majority) whether the examinee
should pass or fail.
7. If pass, choose another examinee with a lower score; if fail, choose another
examinee with a higher score.
8. Repeat steps 5 to 7 until the judges are moving in a small range of scores.
9. Some possible cutting scores (there are many) include the mean of the last 10
passing examinees and the mean of the last 10 failing examinees.
Methods
In these methods, the standard-setters make judgments about the difficulty and
importance of each question on the test. The two most popular of these methods are
Angoff's method (1971) and Ebel's method (1979).
3. Define the "borderline" group, a group that has a 50-50 chance of passing.
4. Read the first item.
5. Each judge estimates the proportion of the borderline group that would get it
right.
6. The ratings are recorded for all to see, discuss, and change as appropriate.
7. Repeat steps 4 to 6 for each item.
8. Calculate the passing score by averaging the estimates of all judges for each
item and summing the items (see Table 1).
Methods
As the name implies, these methods are a compromise between relative and
absolute standards. Consequently, standard-setters make judgments, both about
groups of examinees and the content of the test. The most popular of these methods
is Hofstee's method (de Gruijter, 1985).
Some of the methods developed for use with multiple choice questions can be
applied, without change, to clinical examinations. In other instances, modification
to the methods is required. This section reviews some of the methods applied to
clinical examinations.
The methods that produce relative standards and require judgments about groups of
test takers, the fixed percentage and reference group methods, can be directly
applied following the steps outlined above. They will require no additional effort on
the part of the judges.
The methods that produce absolute standards and require judgments about
individual examinees, the Contrasting-Groups and Up-and-Down methods, can also
be directly applied following the steps outlined above. The experts' judgments
should be based on the devices used to generate examinees' scores including
ratings, checklists, post-encounter paperwork, and the like. Judgments should not be
826 Norcini and Guille
based on videotapes or direct observation, unless they are being used to generate the
scores as well. The task of the judges applying these methods is much more
complex than with MCQs and it is difficult to do well. They must retain all the
information about all the interactions of a single candidate and then repeat this task
for many different examinees.
One way to make the task more manageable is to make the judgments on a case
by case (or problem by problem) basis and then combine the results over all cases
on the test (Clauser & Clyman, 1994). When doing this, it is essential that the
judges focus on the standard-setting question (i.e., whether a performance is good
enough for a particular purpose, like graduating from medical school) not the
scoring question (i.e., whether the examinee's performance is medically correct).
The latter judgments should have been incorporated in the scoring algorithms.
Mills, Jaeger, Plake, and Hambleton (1998) have proposed three variations on the
Contrasting-Groups method. The Sorting, Analytic Judgment, and Integrated
Judgment methods are used to make broad statements about the quality of
examinees. They all require judges to sort a sample of examinees' performances
into two or more categories. In the example used in the study, students were divided
into the categories, "below basic, basic, proficient, and advanced". They then
further divided each category into three subcategories, the extremes of which were
considered boundaries. These boundary subcategories were then used to derive the
score that separates examinees into groups. The three methods differ according to
whether the judgments were made about the whole test or portions of it, and exactly
when the discussions occurred among the judges.
All of the methods presented to this point assume that the pass-fail decision is
based on a compensatory model: examinees can compensate for poor performance
on some problems or cases with good performance on others. A class of profile-
based standard-setting methods has recently been proposed to handle conjunctive
models (e.g. , where examinees need to achieve minimal scores on certain critical
cases or groups of cases as well as an acceptable score overall). In the Dominant
Profile method, Putnam, Pence, and Jaeger (1995) ask judges to work together until
they reach consensus about a minimally acceptable profile of scores. In the
Judgmental Policy Capture method (Jaeger, 1995), a group of judges individually
rate several profiles of performance (e.g., a pattern of scores). Empirical methods,
such as multiple regression, hierarchical cluster analysis, or factor analysis, are used
to derive a single policy or set of pass-fail rules for the test.
These new variations are still experimental and need to be applied with care.
First, some of them require dozens of experts to have days of discussions and/or
consider a large number of performance profiles. For most practical applications, it
is unrealistic to ask experts to devote several days of their time to a process that will
produce a result that is substantially indistinguishable from the result of a briefer,
but still rigorous process.
Second, when the methods are applied to a single test, they permit the judges to
develop weights for the various exercises, which changes the blueprint of the
Combining Tests and Setting Standards 827
A sample history checklist for an SP with unstable angina. Each item on the checklist is
given a weight of 1, 2, or 3 for a total of 38 pOints. To derive the standard, treat the
average judgments as though they are an examinee's responses. Multiply the judges'
averages by the weights and sum the products. Add the results for all the cases to derive
the standard.
Third, some of the methods, especially those that produce relatively complex
decision rules, are heavily based on the discussions and interactions that are
idiosyncratic to a particular group of experts. Consequently, it is essential to
determine if the rules the methods generate are replicated over time and by
independent groups, as are the standards resulting from the more straightforward
828 Norcini and Guille
methods now used. Moreover, scores are typically aggregated across problems or
cases in a straightforward fashion and then a single standard is applied. Some of the
newer methods will result in conjunctive rules that make the pass/fail decision
contingent on performance on a single, relatively unreliable exercise or case.
Conjunctive rules require that each component is sufficiently reliable for decision-
making.
Finally, the use of sophisticated statistical methods to derive standards may
adversely affect the credibility of the result. They simultaneously require the user to
claim the special expertise of the standard-setters but then change their judgments
because they contain errors or do not fit particular mathematical models. Simpler
methods enhance the ability to train the content experts to apply the method and
make it easier to explain the results to examinees and other users of the results.
An application of Ebel's method to categories of items within a case. For all checklist
items with a particular weight (e.g., all items weighted 3) the judges estimate the
proportion of borderline examinees who would respond correctly. For each weight,
average their judgments. To derive the standard, multiply the judges' averages by the
weights and sum the products. Average the cases to derive the standard.
Combining Tests and Setting Standards 829
The methods that produce absolute standards and require judgments about
individual questions, Angoff's and Ebel's methods, can also be applied directly to
cases or problems treated as items. For example, in Angoff's method the judges
would be asked, "what score on this case would you expect the borderline examinee
to get?" For Ebel's method, the cases or problems could be categorized in the same
fashion as the MCQs and then judges could be asked, "what score on this category
of cases would you expect the borderline examinee to get?"
It is also possible to collect judgments, with some modification, for individual
items within cases or problems. Tables 3 and 4 illustrate applications of Angoff's
and Ebel's method to a portion of a hypothetical standardized patient checklist.
Application of Angoff's method in this fashion is very tedious and probably
unnecessary. Ebel's method is less tedious, but it requires blanket judgments for
sets of items that are not always alike.
The methods that produce compromise standards and require judgments both about
groups of examinees and test content, such as Hofstee's method, can be directly
applied following the steps outlined above. It does not require more effort to set
standards on a clinical examination than an examination consisting of multiple
choice questions.
COMBINING TESTS
Sometimes it is useful to combine several, related tests. If they all measure the same
thing, treating them collectively will produce a very reliable result. It has the same
effect as lengthening the test.
If the tests measure somewhat different concepts, treating them collectively will
allow the development of a single score or decision that reflects a more complex
construct. Whenever tests that measure different concepts are combined to predict a
single criterion, they are called a battery. A test battery is preferable to a single test
in this case because it can produce more reliable results and it permits the inclusion
of different item formats that cannot be administered simultaneously (e.g.,
standardized patients and multiple choice questions, Anastasi, 1988).
For example, an examination for graduation from medical school might require
assessments of clinical skills and medical knowledge. A standardized patient
examination would be best to assess history taking and physical exam skills, while a
test based on multiple choice questions is best to assess medical knowledge. To
accommodate both of these formats in a single examination would be cumbersome,
it would create logistic problems, and it would force a reduction in the number of
830 Norcini and Guille
test questions because of examinees' fatigue. Therefore, separate tests combined
after administration are preferable.
A combined score is most useful when making a pass-fail decision, because its use
promotes consistent decision making. This single score is an estimate of the test
taker's ability along the construct measured by the battery. In the previous example,
the combined score for the battery is the student's estimated achievement at the end
of medical school.
Although the combined score should be used in decision making, individual test
scores should also be reported to examinees and other stakeholders. It provides a
richer profile of the individual, thus allowing an understanding of strengths and
weaknesses that can, in tum, lead to continuing improvement of the individual and
the educational program.
The rules for combining test scores should be devised and communicated to
examinees before the tests are administered. This assures that examinees know how
to focus their study efforts and the users of the scores understand their meaning. It
also permits the development of predictions of performance on the combined test.
For example, the student's grades during clerkship rotations might be predictive of
performance on the standardized patient examination. This information, in advance
of test administration, could be used by faculty and students to design a remediation
process that would improve chances of success.
The more items on a test, the more reliable it is. However, there are practical
constraints such as the resources to develop test material and examinee fatigue.
Moreover, beyond a certain point, gains in reliability are minimal and not worth the
effort to acquire them. Therefore, it is not necessarily advantageous to combine a
large number of tests that measure the same concept. Combining tests should be
considered only when one or more of them is too short to stand reliably on its own.
For a battery that combines measures of different constructs, there is also a
practical limit on the number of tests that can reasonably be combined. In fact,
combining many more than four tests measuring different concepts becomes
unwieldy (Cronbach, 1990). Under such circumstances, it is helpful to identify only
the most fundamental concepts and focus on them. For example, it may be better to
devise a single test for knowledge of internal medicine rather than separate tests for
cardiology, gastroenterology, endocrinology, and the like.
Combining Tests and Setting Standards 831
There are two models for combining scores from different tests: compensatory and
non-compensatory. In a compensatory model, test takers are allowed to make up for
poor performance on one test in the battery with good performance on others. The
most straightforward way to combine test scores in a compensatory fashion is to
simply sum them. However, many users multiply each test score by a constant
number or weight before summing them, which is an attempt to increase the relative
importance of some of the tests in the combined score.
There are a variety of methods for generating weights. The most common is to
ask experts in the field to use their judgment in selecting the weights. A second
method is to develop the weights by selecting a criterion and applying a statistical
technique such as multiple regression. Either method is acceptable, although the
estimation of weights by experts is probably more credible. In neither instance,
however, will the weights make a substantial difference in the scores of examinees.
There is considerable work indicating that, as long as the signs of the weights
(positive or negative) are correct, their magnitude makes little difference in the
validity of the combined score (Dawes & Corrigan, 1974; Wainer, 1976).
In a non-compensatory model, test takers are required to perform to a standard on
one or more of the tests in the battery. So, for example, the clinical skills and
medical knowledge tests each have their own standard and, to graduate, a student
must pass both tests. Therefore, good performance on the clinical skills test cannot
make up for poor performance on the medical knowledge test, and vice versa.
There are any number of ways to combine tests in a non-compensatory fashion.
For example, it is possible to set standards or score the battery based on a pattern of
results. To pass, students may need to obtain very high grades on the history
component of the standardized patient examination, less well on the physical exam
component, but very well on the medical knowledge test.
Here are three different examples of passing rules:
1. "Pass" if the student scores greater than 80 on the clinical skills exam and
greater than 70 in medical knowledge, otherwise "Fail".
2. "Pass" if candidate scores greater than 150 on both tests added together,
otherwise "Fail".
3. "Pass" if candidate's total score is greater than 230, where the total score is
equal to 2 times the clinical skills score plus the medical knowledge score,
otherwise "Fail".
Example 1 is a non-compensatory model: the hurdle for clinical skills is 80 and the
hurdle for medical knowledge is 70. This is also known as the method of multiple
cutoff scores. Example 2 is a compensatory model where the candidate can make up
for poor performance on the medical knowledge test with good performance on the
clinical skills examination. Example 3 is also a compensatory model but here the
clinical skills exam has twice the weight of the medical knowledge examination.
832 Norcini and Guille
Simply doubling the weight is unlikely to have the desired effect and it would be
more useful to actually double the length of the standardized patient examination.
Clearly, a mixed compensatory and non-compensatory model can also be
developed. For example, a non-compensatory model may be used as a first stage
and a compensatory model is used as a second stage. Multiple cutoffs are first
applied to produce a field of candidates eligible for stage two scoring. In stage two,
a weighted score could be developed and one of the standard-setting methods
described above could be applied.
REFERENCES