Sei sulla pagina 1di 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/314894221

Test Construction and Evaluation: A Brief Review

Article  in  Indian Journal of Applied Research · June 2015

CITATIONS READS

0 8,586

1 author:

Shafaat Hussain
Madda Walabu University
18 PUBLICATIONS   10 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Shafaat Hussain on 13 March 2017.

The user has requested enhancement of the downloaded file.


Research Paper English Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555X

Test Construction and Evaluation: A Brief Review

Keywords test construction, test evaluation, item analysis

Shafaat Hussain Sumaiya Sajid


Assistant Professor of Communication, Madawalabu Assistant Professor of English, Falahe Ummat Girls PG
University, Bale-Robe, Ethiopia College, Bhadohi, Uttar Pradesh, India
ABSTRACT Beginning from intuitive via scientific, today we are in communicative era of testing. The pursuit for pro-
fessionalism is evidenced by a host of standards or codes of practice which have been developed, imple-
mented and enforced by testing organizations from all over the world. Creating professionally sound assessment re-
quires both art and science. Engaging in fair and meaningful assessments and producing relevant data about students’
achievement is an art. Designing a test, formulating item, and processing grade is a complete science. This review
article reports a survey of test construction; its cyclic formulation process; the phases that it involves (deciding content,
specifying objectives, preparing table of specification and fixing items); and the way it is evaluated (item difficulty and
item discrimination). Teaching and testing are inseparable and in order to be professionally sound in judging the stu-
dent’s performance, it is significant to know the norms, standards and ethics of test construction and evaluation.

1. INTRODUCTION
Good testing practice has been discussed very extensively in 2. TEST CONSTRUCTION
language testing literature, and has been approached from Ideally, effective tests have some characteristics. They are
different perspectives by language testing researchers. A com- valid (providing useful information about the concepts they
mon approach to addressing this issue, for example, is to dis- were designed to test), reliable (allowing consistent meas-
cuss how a language test should be developed, administered, urement and discriminating between different levels of per-
and evaluated (Alderson, Clapham and Wall 1995; Li 1997; formance), recognizable (instruction has prepared students
Heaton 2000; Fulcher 2010). These discussions are primarily fo- for the assessment), realistic (concerning time and effort
cusing on good practice in each and every step in the testing required to complete the assignment) practical and ob-
cycle, including, for instance, test specifications, item writing, jective. To achieve these, the teacher must draw up a test
test administration, marking, reporting test results, and post blue print or plan of specifying the objectives, preparing
hoc test data analyses. Another common approach to discuss table of specification, allocating the test length as per time
good testing practice is to focus upon one particular dimen- limit, and deciding the types of items to be set (Wiggins
sion of language testing, to develop theoretical models about 1998; Svinicki 1999).
this particular dimension, and then to apply those theoretical
models on language testing practice. For example, Bachman
and Palmer (1996) developed a model of ‘test usefulness’,
which, as they argued, was ‘the most important consideration
in designing and developing a language test’. Other exam-
ples adopting this approach are Cheng, Watanabe, and Cur-
tis (2004), focusing on test washback, Kunnan (2000, 2004) on
test fairness, Shohamy (2001a, b) on use-oriented testing and
the power of tests, and McNamara and Roever (2006) on the
social dimensions of language testing. Good testing practice
has also been considerably documented in the standards or
codes of practice which have been developed by testing or re-
search organizations from all over the world. For example, the
ILTA guidelines (2007), the ALTE Code of Practice (1994), the Figure 2: Stages in Test construction
EALTA Guidelines (2006), ETS Standards for Quality and Fair-
ness (2002) and the like (Boyd and Davies 2002; Fulcher and It should include details of test content in the spe-
Davidson 2007; Bachman and Palmer 2010). cific course. Moreover, each content area should be
weighted roughly in proportion to its judged impor-
tance. Usually, the weights are assigned according to
the relative emphasis placed upon each topic in the
textbook. The median number of pages on a given
topic in the prescribed books is usually considered
as an index of its importance. To devise a classroom
tests, the advice and assistance of fellow teachers can
prove to be of immense importance (Wiggins 1998;
Riaz 2008).

1.2 Specifying the Objectives


Each subject demands a different set of instructional ob-
Figure 1: Cyclic process of teaching and testing jectives. For example, major objectives of the subjects like

INDIAN JOURNAL OF APPLIED RESEARCH X 725


Research Paper Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555X

sciences, social sciences, and mathematics are: knowledge, made to avoid trivial, broad, general and negative state-
understanding, application and skill. On the other hand ments. When a negative word is necessary and cannot be
the major objectives of a language course are: knowledge, ignored, it should be underlined or put in italics so that
comprehension and expression. Knowledge objective is students do not overlook it. Second, it is better not to
considered to be the lowest level of learning whereas un- include two ideas in one statement unless there is cause-
derstanding, application of knowledge is considered high- effect relationship. Third, those opinions should not be
er levels of learning. As the basic objectives of education used which are attributed to some sources, or the ability
are concerned with the modification of human behavior, to identify opinion is being specifically measured. Fourth,
the teacher must determine measurable cognitive out- true or false statements should be equal in length. Fifth,
comes of instruction at the beginning of the course. The there should be proportionate numbers of true and false
test determines the extent to which the objectives have statements and finally, statements should be simple in lan-
been attained, both for the individual students and for guage and understanding (Gronlund and Linn 1990; Chase
the class as a whole. Some objectives are stated as broad, and Jacobs, 1992; Wiggins 1998; McMillan 2001).
general, long-range goals, e.g., ability to exercise the
mental functions of reasoning, imagination, critical appre- 2.5.2 Constructing Completion/Gap Filling Items
ciation. These educational objectives are too general to be While constructing the completion/gap filling items, an at-
measured by classroom tests and need to be operationally tempt should be made to word the item so that the re-
defined by the class teacher (Wiggins 1998; Riaz 2008). quired answer is both brief and specific. A direct question
is generally more desirable than an incomplete statement.
2.3 Preparing Table of Specifications Direct statements from textbooks should not be taken
A table of specifications is a two-way table that represents as an item. If the answer is to be expressed in numeri-
along one axis the content area/topics that the teacher cal units, we should indicate the type of answer wanted.
has taught during the specified period and the cognitive Blanks for answers (gap filling space) should be equal in
level at which it is to be measured, along the other axis. length and in a column to the right of the question. In-
In other words, the table of specifications highlights how cluding too many blanks in one statement is not advisable
much emphasis is to be given to each objective or topic. (Gronlund and Linn 1990; Chase and Jacobs, 1992; Wig-
While writing the test items, it may not be possible to at- gins 1998; McMillan 2001).
tempt to adhere very rigorously to the weights assigned
in each cell. Thus, the weights indicated in the original ta- 2.5.3 Constructing Matching Items
ble may need to be slightly changed during the course of While constructing matching items, homogeneous ma-
test construction, if sound reasons for such a change are terial in a single matching exercise should be used. It is
encountered by the teacher. For instance, the teacher may advised to include an unequal number of responses and
find it appropriate to modify the original test plan in view premises and instruct the student that responses may be
of data obtained from the experimental try-out of the new used once, more than once, or not at all. Brief portion
test (Wiggins 1998; Riaz 2008). should be kept in the left column, and shorter responses
should be placed on the right. We should arrange the
2.4 Deciding Test Length list of responses in logical order. Care should be made
The number of items that should constitute the final form of on placing words in alphabetical order and numbers in
a test is determined by the purpose of the test or its pro- sequence. There must be indications in the directions
posed uses, and by the statistical characteristics of the items. the basis for matching the responses and premises. We
Three important considerations in setting test length are: (i) should stay away from ambiguity so that testing time dur-
the optimal number of items for a homogenous test is lower ing examination may be saved and finally, care must be
than for a highly heterogeneous test; (ii) items that are meant taken in placing one matching exercise on the same page
to assess higher thought processes like logical reasoning, (Gronlund and Linn 1990; Chase and Jacobs, 1992; Wig-
creativity, abstract thinking etc., require more time than those gins 1998; McMillan 2001).
dependent on our ability to recall important information and
(iii) another important consideration in determining the length 2.5.4 Constructing Multiple-choice items
of test and the time required for it is related to the validity Stem and choices are the two parts of a multiple-choice
and reliability of the test. The teacher has to determine the item. A stem should be clearly described and should be
number of items that will yield maximum validity and reliabil- complete in itself. The options provided should be as short
ity of the particular test (Wiggins 1998; Riaz 2008). as possible. Only that information should be placed in the
stem which needs to make the problem clear and specific.
2.5 Fixing Types of Items The stem of the question should communicate the nature
Each type of exam item has its advantages and disadvan- of the task to the students and present a clear problem or
tages in terms of ease of design, implementation, and concept. The stem should provide only information that is
scoring, and in its ability to measure different aspects of relevant to the problem or concept, and the options (dis-
students’ knowledge or skills. Multiple choice and essay tracters) should be succinct. Avoid the use of negatives in
items are often used in college-level assessment because the stem (use only when you are measuring whether the
they readily lend themselves to measuring higher order respondent knows the exception to a rule or can detect er-
thinking skills (e.g., application, justification, inference, rors). You can word most concepts in positive terms and
analysis and evaluation). Yet instructors often struggle to thus avoid the possibility that students will overlook terms
create, implement, and score these items (Worthen et al. of “no, not, or least” and choose an incorrect option not
1993; Wiggins 1998; McMillan 2001). Here, an attempt because they lack the knowledge of the concept but be-
would be made to examine the guidelines to be followed cause they have misread the stated question. Italicizing,
while designing major types of items like true-false, gap- capitalizing, using bold-face, or underlying the negative
filling, matching, multiple-choice and essay types. term makes it less likely to be overlooked.

2.5.1 Constructing True-False Items Regarding choices or options, attempt should be taken to
While constructing true-false items, attempts should be have only one correct answer. Make certain that the item

726 X INDIAN JOURNAL OF APPLIED RESEARCH


Research Paper Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555X

has one correct answer. Multiple-choice items usually have essay questions they will answer. A common structure in
at least three incorrect options (distracters). Write the cor- many exams is to provide students with a choice of es-
rect response with no irrelevant clues. A common mistake say items (e.g., “choose two out of the three essay ques-
when designing multiple-choice questions is to write the tions to complete…”). Instructors, and many students, of-
correct option with more elaboration or detail, using more ten view essay choice as a way to increase the flexibility
words, or using general terminology rather than techni- and fairness of the exam by allowing learners to focus on
cal terminology. Write the distracters to be plausible yet those items for which they feel most prepared. However,
clearly wrong. An important, and sometimes difficult to the choice actually decreases the validity and reliability of
achieve is ensuring that the incorrect choices (distracters) the instrument because each student is essentially taking
appear to be possibly correct. Distracters are best creat- a different test. Creating parallel essay items (from which
ed using common errors or misunderstandings about the students choose a subset) that test the same educational
concept being assessed, and making them homogeneous objectives (skills, knowledge) is very difficult, and unless
in content and parallel in form and grammar. We should students are answering the same questions that measure
refrain from “all of the above,” “none of the above,” or the same outcomes, scoring the essay items and the infer-
other special distracters (use only when an answer can ences made about student ability are less valid. While al-
be classified as unequivocally correct or incorrect). None lowing students a choice gives them the perception that
of the above should be restricted to the items of factual they have the opportunity to do their best work, you must
knowledge with absolute standards of correctness. It is in- also recognize that choice entails difficulty in drawing con-
appropriate for questions where students are asked to se- sistent and valid conclusions about student answers and
lect “the best” answer. All of the above is awkward in that performance. Consider using several narrowly focused
many students will choose it if they can identify at least items rather than one broad item. For many educational
one of the other options as correct and therefore assume objectives aimed at higher order reasoning skills, creating
all of the choices are correct – thereby obtaining a correct a series of essay items that elicit different aspects students’
answer based on partial knowledge of the concept/content skills and knowledge can be more efficient than attempting
(Gronlund and Linn, 1990). We must use each alternative to create one question to capture multiple objectives. By
as the correct answer about the same number of times. using multiple essay items (which all students complete),
Check to see whether option “a” is correct about the you can capture a variety of skills and knowledge while
same number of times as option “b” or “c” or “d” across also covering a greater breadth of course content (Cashin
the instrument. It can be surprising to find that one has 1987; Gronlund and Linn 1990; Worthen et al. 1993; Wig-
created an exam in which the choice “a” is correct 90% of gins 1998; McMillan 2001).
the time. Students quickly find such patterns and increase
their chances of “correct guessing” by selecting that an- 3. WHOLISTIC PERSPECTIVE
swer option by default. (Gronlund and Linn 1990; Wiggins Different types of questions can be devised for an
1998; McMillan 2001). achievement test, for instance, multiple choice, fill-in-the-
blank, true-false, matching, short answer and essay. Each
2.5.5 Constructing Essay Items type of question is constructed differently with different
Essays can tap complex thinking by requiring students to principles. Instructions for each type of question must be
organize and integrate information, interpret information, simple and brief. Questions ought to be written in simple
construct arguments, give explanations, evaluate the merit language. If the language is difficult or ambiguous, even
of ideas, and carry out other types of reasoning. In prac- a student with strong language skills and good vocabu-
tice, we must restrict the use of essay questions to edu- lary may answer incorrectly if his/her interpretation of the
cational outcomes that are difficult to measure using other question is different from the author’s intended meaning
formats. Construct the item to elicit skills and knowledge (Worthen et al. 1993; Thorndike 1997; Wiggins 1998).
in the educational outcomes. Write the item so that stu- Test items must assess specific ability or comprehension
dents clearly understand the specific task. Other assess- of content developed during the course of study (Gron-
ment formats are better for measuring recall knowledge lund and Linn 1990). Write the questions as you teach so
but the essay is able to measure deep understanding and that your teaching may be aimed at significant learning
mastery of complex information. Once you have identi- outcomes. A tester has to devise questions that call for
fied the specific skills and knowledge, you should word comprehension and application of knowledge skills. Some
the question clearly and concisely so that it communicates of the questions must aim at appraisal of examinees’ abil-
to the students the specific task(s) you expected them to ity to analyze, synthesize, and evaluate novel instances of
complete (e.g., state, formulate, evaluate, use the principle the concepts. If the instances are the same as used in in-
of, create a plan for, etc.). If the language is ambiguous struction, students are only being asked to recall (knowl-
or students feel they are guessing at “what the instructor edge level). Questions should be written in different for-
wants me to do,” the ability of the item to measure the in- mats, e.g., multiple-choice, completion, true-false, short
tended skill or knowledge decreases. Indicate the amount answer etc. to maintain interest and motivation of the stu-
of time and effort students should spend on each essay dents. The teacher should prepare alternate forms of the
item. In essay items, especially when used in multiples test to deter cheating and to provide for make-up testing
and/or combined with other item formats, you should pro- (if needed). The items should be phrased so that the con-
vide students with a general time limit or time estimate to tent rather than the format of the statements will deter-
help them structure their responses. Providing estimates of mine the answer. Sometimes, the item contains “specific
length of written responses to each item can also help stu- determiners” which provide an irrelevant cue to the cor-
dents manage their time, providing cues about the depth rect answer. For example, statements that contain terms
and breadth of information that is required to complete like always, never, entirely, absolutely, and exclusively are
the item. In restricted-response items a few paragraphs are much more likely to be false than to be true. On the oth-
usually sufficient to complete a task focusing on a single er hand, such terms as may, sometimes, as a rule, and in
educational outcome. general are much more likely to be true.

We should stay away giving students options as to which Besides, care should be taken to avoid double negatives,

INDIAN JOURNAL OF APPLIED RESEARCH X 727


Research Paper Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555X

complicated sentence structures, and unusual words. The placed in the initial part of the test to motivate students
difficulty level of the items should be appropriate for the in taking the test and alleviating test-anxiety. The optimal
ability level of the group. Optimal difficulty for true-false item difficulty depends on the question type and number
items is about 75 percent, for five-option multiple choice of possible distracters as well (Wiggins 1998; Riaz 2008).
questions about 60 percent, and for completion items ap-
proximately 50 percent. However, difficulty in itself is not 4.1.2 Item Discrimination
an end. The item content should be determined by the Another way to evaluate items is to ask “Who gets this
importance of the subject matter. It is desirable to place item correct” − the good, average and the weak students?
a few easy items in the beginning to motivate students, Assessment of item discrimination answers this query. Item
particularly those who are of below average ability (Wig- discrimination refers to the percentage difference in cor-
gins 1998; Halpern and Hakel 2003). The items should be rect responses between the poor and the high scoring stu-
devised in such a manner that different taxonomy levels dents. In a small class of 30 students, one can administer
are evaluated. Besides, items pertaining to a specific topic the test items, score them and then rank the students in
or of a particular type should be placed together in the terms of their overall score. Next, we separate the upper
test. Such a grouping facilitates scoring and evaluation. 15 students and the low 15 into two groups: The upper
It will also be helpful for the examinees to think and an- and the lower groups. Finally, we find how well each item
swer the items, similar in content and format, in a better was solved correctly (p) by each group. In other words,
manner without fluctuation of attention and changing the percentage of students passing (p) each item in each of
mind set. Directions to the examinees should be as simple, the two groups is worked out. Discrimination (D) power of
clear, and precise as possible, so that even those students the item is then known by finding difference between the
who are of below average ability can clearly understand percentage of upper group and the low group. The high-
what they are expected to do. Scoring procedures must er the difference, the greater the discrimination power of
be clearly defined before the test is administered. The test an item. An item with a discrimination of 60% or greater
constructor must clearly state optimal testing conditions for is considered a very good item, whereas a discrimination
test administration. Item analysis should be carried out to of less than 20% indicates a low discrimination and the
make necessary changes, if any ambiguity is found in the item needs to be revised. An item with a negative index
items (Gronlund and Linn 1990; Wiggins 1998; McMillan of discrimination indicates that the poor students answer
2001). correctly more often than do the good students. Strange!
Such items should be dropped from the test. Most difficult
4. TEST EVALUATION items having negative discrimination should be removed
A good test has good items. Good test making requires from the quiz. 100% discrimination would occur if all those
careful attention to the principles of item evaluation. Of- in the upper group answered correctly and all those in the
ten students judge, after taking the exam, whether the test lower group answered incorrectly. Zero discrimination oc-
was fair and good. Teacher is also usually interested about curs when equal numbers in both groups answer correctly.
how the test worked for the students. Negative discrimination, a highly undesirable condition,
occurs when more students in the lower group than the
4.1 Item Analysis upper group answer correctly. Items with 25% and above
Item analysis is about how difficult an item is and how discrimination are considered good (Wiggins 1998; Riaz
well it can discriminate between the good and the poor 2008).
students. In other words, item analysis provides a numeri-
cal assessment of item difficulty and item discrimination. 5. CONCLUSION
It provides objective, external and empirical evidence for Tests have undergone radical changes in the past sixty
the quality of the items. The objective of item analysis is to years due to improvements in measurement techniques
identify problematic or poor items which might be either and better understanding of learning processes. From a
confusing the respondents or do not have a clearly correct lengthy three hours essay type examination one can ass-
response or a distracter might well be competing with the es more comprehensively in thirty minutes objective type
keyed answer. Item analysis comprises item difficulty and paper which can assess not only the knowledge but also
item discrimination (Wiggins 1998; Riaz 2008). comprehension and application of knowledge. Additionally,
a well prepared paper can evaluate the students objective-
4.1.1 Item Difficulty ly and quickly and large number of students in a class is
Item difficulty is determined from the proportion (p) of not a problem. Tests are the goal posts which act as guide
students who answered each item correctly. Item difficulty and motivators for students to learn. We all know from our
can range from zero (none could solve it) to hundred (all own experiences how students prepare for the examina-
persons solved it correctly). The goal is usually to have tions. They not only learn what interests them the most or
items of all difficulty levels in the test so that test could are presented in a better way but also what type of pa-
identify poor, average as well as good students. However, per they expect from the teacher. Due to this factor a well
most of the items are designed to be average in difficulty prepared examination paper is a guarantee of an effective
levels for they are more useful. Item analysis exercise pro- teaching learning process.
vides us the difficulty level of each item. Optimally difficult
items are those that 50% − 75% of students answer cor-
rectly. Items are considered low to moderately difficult if
(p) is between 70% and 85% Items that only 30% or below
solve correctly are considered difficult ones. Item Difficulty
Percentage can also be denoted as Item Difficulty Index by
expressing it in decimals e.g. .40 for items which could be
solved by 40 % of the test-takers. Thus index can range
from 0 to 1. Items should fall in a variety of difficulty lev-
els in order to differentiate between good and average as
well as average and poor students. Easy items are usually

728 X INDIAN JOURNAL OF APPLIED RESEARCH


Research Paper Volume : 5 | Issue : 6 | June 2015 | ISSN - 2249-555X

REFERENCE [1]Alderson, JC, Clapham, C, & Wall, D. 1995. Language test construction and evaluation. Cambridge: Cambridge University Press. | [2]
Bachman, LF, & Palmer, AS. 1996. Language testing in practice. Oxford: Oxford University Press. | [3]Bachman, LF, & Palmer, AS. 2010. Language
assessment in practice: developing language assessments and justifying their use in real world. Oxford: Oxford University Press. | [4]Fulcher, G. 2010. Practical
language testing. London: Holder Education. | [5]Fulcher, G & Davidson, F. 2007. Language testing and assessment: an advanced resource book. Oxon: Routledge.
| [6]Heaton, J. 2000. Writing English language tests. Beijing: Foreign Language Teaching and Research Press. | [7]Boyd, K & Davies, A. 2002. Doctor’s orders for
language testers: the origin and purpose of ethical codes. Language Testing, 19(3), 296–322. | [8]Brown, F. G. 1983. Principles of educational and psychological
testing. 3rd edition. New York: Holt, Rinehart and Winston. | [9]Cashin, W. E. 1987. Improving essay tests. Manhattan: Center for Faculty Evaluation and Development.
| [10]Cheng, L, Watanabe, Y & Curtis, A. 2004. Washback in language testing: Research methods and contexts. London: Lawrence Erlbaum. | [11]Gronlund, NE &
Linn, RL. 1990. Measurement and evaluation in teaching. 6th edition. New York: Macmillan. | [12]Halpern, DH & Hakel, MD. 2003. Applying the science of learning to
the university and beyond. Change, 35(4), 37-41. | [13]Isaac, S. & Michael, WB. 1990. Handbook in research and evaluation. San Diego: CA. | [14]Kunnan, AJ. 2000.
Fairness and validation in language assessment. Cambridge: Cambridge University Press. | [15]Milanovic, M & Weir, C. 2004. European language testing in a global
context: proceedings of the ALTE Barcelona conference. Cambridge: Cambridge University Press. | [16]Li, X. 1997. The science and art of language testing. Changsha:
Hunan Education Press. | [17]McMillan, JH. 2001. Classroom assessment: Principles and practice for effective instruction. Boston: Allyn and Bacon. | [18]McNamara, T
& Roever, C. 2006. The social dimensions of language testing. Oxford: Blackwell Publishing. | [19]Riaz, MN. 2008. Test Construction: Development and Standardization
of Psychological Tests in Pakistan. Islamabad: HEC. | [20]Shohamy, E. 2001a. The power of tests: a critical perspective on the uses of language tests. Essex: Pearson
Education. | [21]Shohamy, E. 2001b. Democratic assessment as an alternative. Language Testing, 18(4), 373–391. | [22]Svinicki, MD. 1999. Evaluating and grading
students. Austin: University of Texas. | [23]Thorndike, RM. 1997. Measurement and evaluation in psychology and education. New Jersy: Prentice-Hall. | [24]Wiggins,
GP. 1998. Educative assessment: Designing assessments to inform and improve student performance. San Francisco: Jossey-Bass. | [25]Worthen, BR, Borg, WR &
White, KR 1993. Measurement and evaluation in the schools. New York: Longman. |

INDIAN JOURNAL OF APPLIED RESEARCH X 729


View publication stats

Potrebbero piacerti anche