Sei sulla pagina 1di 9

Testing Criteria

Ron Kok ©2004

A definition of testing

In its broadest and sense, testing refers to a means of assessing or measuring the quality of something. How much of a particular element might be embedded in an ore? How elegantly can the gymnast perform on the parallel bars? How much of what the teacher taught can the student remember?

Standardized tests

Standardized tests refer to tests that measure someone’s performance compared with the average performance scores of larger groups of people. In other words, when we use a standardized test we are using a test that has been used so frequently that we have average scores for large numbers of people: we have standards, in other words. Your individual score on a standardized test reflects how you stand in relation with other people who have taken the test. An I.Q. test, for instance, is a standardized test, and your score reveals what percentage of the people scored lower than you did and what percentage of people scored higher than you did. A GRE score functions in the same manner: your specific GRE score tells you how many people scored lower or higher than you did. Naturally, for purposes of entering college, you would want to rank higher in the percentage scale.

Criterion Reference Tests

Criterion reference tests refer to tests that measure how well someone has met specific standards of performance. The standard of performance is be called a “criterion”—hence the name for these types of tests which “refer” to such performance standards. For instance, in order to select members for the cross country team a coach might decide that candidates would have to be able to run a mile within 7 minutes. The coach has hence set a criterion of performance that she wants candidates to meet. In order for a diamond to qualify as worthy of sale it has to go through rigorous testing procedures that determine how many impurities, if any, are present. In order for the teacher to determine whether or not a student has learned a lesson, the teacher asks questions or poses problems that aim to tell her whether or not the students have met all the learning objectives.

Teacher-Made Tests

Obviously teacher-made tests are tests designed by the teacher herself. There is not a single student who has not been subjected—for better or for worse—to teacher-made tests. Teacher-made tests are almost by definition criterion-reference tests because they aim to measure how well a student has learned what has been taught. This seems obvious, but, as you’ll find out, extraordinarily difficult to do correctly and effectively.

The most common problem with teacher-made tests is that the the criterion of performance may never have been clearly set. For instance, after a series of some very interesting activities on Medieval cities and cathedrals, a teacher might decide to write a unit test on the materials in the text. Such a test might be valid if the teacher wanted to test the students on their understanding of the reading, but not if it was a means to assess the students learning for the entire unit. Another problem is that students often do not know what they are going to be tested on—in other words, they do not know what the performance criteria are.

If there is one principle you must keep in mind as a teacher it is that testing is not a means to assess a student’s worth or stamina. It is not a means to trick students into revealing what it is they do not know. Rather, testing, from a teacher’s perspective, is a means to assess whether or not a student has

met specific learning objectives. Testing can be done with the traditional multiple-choice tests, essay tests, oral exams, performance tests, or any other means as long as the means of evaluation focus on the objectives that should have been met by the student. From this perspective, too, a student’s poor performance on a test may not necessarily indicate poor study habits on the part of the student but it might also indicate—God forbid—the result of poor teaching.

Proper testing criteria.

One way of ensuring that your tests are properly designed is to consider specific criteria of proper testing. These criteria are:

1. Validity

2. Reliability

3. Objectivity

4. Comprehensiveness

5. Administrative Ease



A valid test is one that measures the learning objectives realistically and effectively. To test someone’s

swimming ability, for instance, you would not devise a written test, but one that had to be taken in the water. The written test in this case would be an invalid test since it cannot measure someone’s performance in the water. By the same token, if you wanted to test a student’s ability to draw microorganisms from a drop of water under a microscope, you would not want to give that student an essay or multiple choice test on the uses of the microscope.

In asking yourself whether or not you have a valid test you must ask yourself if the type of test you

are using is really the best way of assessing the most important types of learning that the students should have engaged in. In other words do you have a valid tool for the task?

Task: Read the following examples and determine whether or not each test item is valid or invalid.

A life science teacher has decided that the following three objectives are her most important learning objectives for a unit on the California chaparral biome.

1. After a slide session showing examples of different types of biomes, students will be able to identify those slides that show

the visual characteristics of the California chaparral.

2. Given actual samples of different plants of the chaparral, students will be able to make generalizations about the possible

methods used by plants to survive prolonged drought and heat conditions.

3. After a video about fires in the chaparral, students will be able to describe the role played by such fires in the chaparral life


On her test this teacher decided that among other questions the following were, to her, the most important ones. Please identify the items as being valid or invalid and explain why.

1. Define the term “chaparral.”

2. Please explain in your own words how different chaparral plants manage to survive prolonged drought and heat.

3. On a map of California, please identify the places where chaparral occurs.

4. Please identify by name the chaparral plants pictured below.

5. Please describe the role of fire in the life cycle of a manzanita.

If there is one generalization you could make about the issue of validity and teacher-made tests, what

would it be?

2. Reliability

By a “reliable” test we mean a test whose score is a trustworthy assessment of a student’s skills. After a test a teacher should always ask him-or herself the following question: “is the score that this student received on this test an accurate reflection of the student’s understanding of the material, or is the score due to other factors that have nothing to do with the student’s understanding?” Many factors can influence the reliability of a test. Among them are the following:

1. Lack of instructions. Every test should have instructions for the test as a whole and for different sections. make sure that you tell your students how you want them to answer the test and how they should mark their answers. It helps to provide examples.

2. Clarity If a test is difficult to read because of spelling mistakes, sloppiness, hand-written items, etc., the student might answer the question incorrectly through no fault of his or her own.

3. Ambiguity. If the wording of a question is puzzling or if the directions are unclear the student may provide an answer that is not the answer you are looking for.

4. Statistical Random Error This is a fancy way of describing the effect of “guessing” on a test item. The more chances there are that a student can make a lucky guess, the larger the “statistical random error” there is on a test. What would be the percentage of error in a true/false question? What would be the percentage in a multiple choice question with 4 choices? What type of question might provide the least amount of random error?

5. Length and Variety of the Test If the test consists of too many questions of the same type the students will become fatigued and bored with the test. The result is often that they no longer want to think carefully about the items. Keep tests relatively short and vary the types of questions on a test. Remember that classroom tests are not tests if stamina.

6. Other Factors What other factors might affect a student’s score and hence the “reliability” of that score?

Task: Please read the following examples and determine whether or not each test item is reliable or unreliable and explain the reasons for your response.

1. The students are given a test on a chaparral unit. Although the teacher thought that she had included some difficult questions every single student obtained an “A.”

2. The teacher had to give students a test and the morning of the test discovered he had no access to a typewriter or computer. He had to write the test in long-hand. Although students had some difficulty reading the test the teacher felt that the test was fair. Most students scored in the C+, B- range.

3. Carol had a fever when she took the test. She told the teacher that she had studied hard, but she still received a “D” on the exam.

The following items found on tests are of dubious reliability. Please analyze each item and explain why these items might be unreliable.

1. The Judiciary committee’s impeachment deliberation resulted in a resolution in favor of

a. no impeachment

b. the majority voted for three articles

c. a sharp division between the two parties

d. one article cited obstruction of justice


America was discovered by

3. Which of the following is the best brief description of the novelist’s writing?

a. An approach to characterization and description that might be considered flowery and convoluted.

b. An approach to characterization and theme development that might be considered to be over-action focused.

c. An approach to characterization that is psychoanalytical.

d. An approach to characterization and character development that is focused on inner-feeling.

4. The value of ! is 3.14 (true/false)

3. Comprehensiveness

A test should cover all the objectives of a unit adequately with more emphasis on those objectives

that are considered more important than others. In other words, the test should be “weighted” in terms of content. Among the more important ones should be objectives testing for higher types of thinking skills such as application, analysis, evaluation, and synthesis.

Task: Please read the following examples and determine whether or not the test item is comprehensive or non-comprehensive.

Example for analysis 1

A teacher is teaching a unit on polygons. She has written a number of objectives among which are identifying by name

various polygons, calculating the area of various polygons, and calculating the areas of polygons within polygons.

On the test which consisted of 30 items she asked 10 problems asking students to calculate the areas of rectangles, 10 problems asking students to calculate the areas of triangles, and 10 asking students to calculate the area of trapezoids.

Example for analysis 2

A teacher teaching the play of Romeo and Juliette spent 1 day on a brief history of the Renaissance, 1 day on the stage

under Elizabeth I, 1 day on the biography of Shakespeare, 3 days on watching a movie of the play, and 4 days discussing the play which the students read at home.

On the test, 60% of the grade depended on the answer to the following question: given the situation of Romeo and Juliet, can you think of any political situation today where this scenario could occur.? Please write a brief scenario of such an event, using modern social groups and terms.

4. Objectivity

Whether or not a student answered an item correctly should not be a matter of personal interpretation by the grader. The more precise the criteria for the test are, the more objective the test will be.

Task: Please read the following examples, explain why they might be non-objective, and describe how you might want to change the item or situation.

1. Everyone in the United States has an equal opportunity in education. (circle one: true


2. A student obtained a low grade on an essay test graded by the teacher’s assistant. Upset, the student turned to the professor who taught the class and who reread the essay test. On the second reading the professor thought that the student did indeed make some important points and he hence increased the grade.

5. Practicality

A test should be relatively easy to administer, take, and grade. These are most practical and

important characteristics to keep in mind. If the test is too cumbersome to use or too difficult to administer it will put too much emphasis on the evaluation rather than on the teaching and learning aspects of a lesson or unit. Thus although you must keep the criteria of objectivity, reliability, validity and comprehensiveness in mind, one must balance these with the practical considerations of time and ease of administration.

Task: Please rate the following items by stating the most practical type of test first, to be followed by the second less practical type and by stating the least practical type last. Put a “1” in front of the item you consider to be the most practical, a “2” in front of the next practical item, etc.

A strictly multiple choice test.

A test consisting of 3 essay tests with detailed evaluation criteria.

A test which has been used in the past by the teacher and which has shown to be reliable and valid.

An exhaustive test that tests every objective 3 time in order to maintain reliability and validity.


ready-made tests which is highly reliable and valid but which needs to be bought and consists quite a

bit of money.

Ty pes of A s s es s m en t s

Te x t b o o k & Te a ch e Te r -m
Te x t b o o k &
Te a
ch e
r -m a de
st s
P o p q uiz z es
N o r m R e fe r e n ce
st s
Unit Tests (multiple choice, essay, fill-ins, etc.)
STA R Tes t s
Chapter Tests
I.Q . Tes t s
Authentic Assessment
En h a n c
M ult ip le C h o ic e
r o
t s
O p en
-en d ed
es s a y s
Pe r fo r m
a n ce
A sse ssm e n t

What Various Assessments Mean

Finding out what students already know before entering your class or before starting a new unit:

Pre-tests Achievement tests can be used also as long as the score does not imply finality, but, rather, a starting point for the teacher.

Finding out how well students have learned (achievement tests)

In high school: California High School Exit Exam (CAHSEE) In class: Various classroom assessment tools (next page).

Finding out whether or not kids have met the objectives for a lesson, a unit, or a curriculum:

Criterion reference tests. (Example: By the end of the lesson, the students will be able to explain in their own words the reasons why Goldilocks selected the “medium” chair.) See also Various classroom assessment tools (next page).

Finding out how much kids know or can do in comparison with “most” kids of their age level: Standardized Tests

California English Language Development Tests (CELDT) California Standards Tests (CSTs)

Characteristics of Authentic Tests*

A. Structure and Logistics

1. Are more appropriately public; involve an audience, a panel, and so on.

2. Do not rely on unrealistic and arbitrary time constraints.

3. Offer known, not secret, questions or tasks.

4. Are more like portfolios or a season of games (not one-shot).

5. Require some collaboration with others.

6. Recur—and are worth practicing for, rehearsing, and retaking.

7. Make assessment and feedback to students so central that school schedules, structures, and policies are modified to support them.

B. Intellectual Design Features

1. Are “essential”—not needlessly intrusive, arbitrary, or contrived to “shake out” a grade.

2. Are “enabling—constructed to point the student toward more sophisticated use of the skills of knowledge.

3. Are contextualized, complex intellectual challenges, not “atomized” tasks, corresponding to isolated “outcomes.”

4. Involve the student’s own research or use of knowledge, for which “content” is a means.

5. Assess student habits and repertoires, not mere recall or plug-in skills.

6. Are representative challenges—designed to emphasize depth more than breadth.

7. Are engaging and educational.

8. Involve somewhat ambiguous (“ill-structured”) tasks or problems.

C. Grading and Scoring Standards

1. Involve criteria that assess essentials, not easily counted (but relatively unimportant) errors.

2. Are graded not on a “curve” but in reference to performance standards (criterion- referenced, not norm referenced).

3. Involve demystified criteria of success that appear to students as inherent in successful activity.

4. make self-assessment a part of the assessment.

5. Use a multifaceted scoring system instead of one aggregate grade.

6. Exhibit harmony with shared schoolwide aims—a standard.

D. Fairness and Equity

1. Ferret out and identify (perhaps hidden) strengths.

2. Strike a constantly examined balance between honoring achievement and native skill or fortunate prior training.

3. Minimize needless, unfair, and demoralizing comparisons.

4. Allow appropriate room for student learning styles, aptitudes, and interests.

5. Can be—should be—attempted by all students, with the test “scaffolded up,” not “dumbed down,” as necessary.

Grant Wiggins (1989), Teaching to the Authentic Test, Educational Leadership, 45(7), p. 44.

F r o m t h e F o l d e r t o t h e P o r t f o l i o D e v e l o p i n g P o r t f o l i o S k i l l s

Level One The folder is a place to store work

o S k i l l s Level One The folder is a place to store

L ev el Tw o T h e p iec es a r e d a t ed , t it les r ec o r d ed ; C h r o n o lo g ic a l; p o t en t ia l fo r s h o w in g d ev elo p m en t

ic a l; p o t en t ia l fo r s h o w

Level Three Date, title, form (writing), genre (reading) Audience, level of process development

genre (reading) Audience, level of process development L ev el Fou r D a t e,

L ev el Fou r D a t e, t it le, fo r m , g en r e, a ud ien c e, p r o c e s s Skills / g o a ls ;id ea s t o w r it e a b o ut

W h e n it b e com e s m ore th an a colle ction , it b e com e s a p ortfolio

Pr o du ct Po r t fo lio :

Pr o ce ss Po r t fo lio : R eflec t io n / Self-a s s es s m en t St ud en t a n d t e a c h er in p ut C o n s c io us s t a t em en t o f g r o w t h

S h o w ca se Po r t fo lio : T h e St ud en t 's b es t w o r k.

A few s elec t ed p iec es (s a m p les ) s c o r ed h o lis t ic a lly