Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
CHAPTER 3
CHAPTER 3
PRINCIPLES OF GOOD
ASSESSMENT
INTRODUCTION
Assessment is a critical component of any educational program. It involves the
selection, collection, and interpretation of information about students performance and
program adequacy. In relation to that, assessment is an educational process directed
toward the improvement of instruction and the assurance of students success. These
again may be accomplished through various ways or procedures.
Assessment is critical as well as integral part of teaching and learning process. It
cannot be isolated or evolved as a stand alone educational process. A major part of
assessment is taking place in the classrooms during instructions. Through assessment,
teachers gather information and insight about their students and their learning.
Assessment is therefore embedded in every aspect of classroom thinking, planning,
and action. Its importance cannot be overlook. Hence, as educators we need guiding
principles in executing assessment process in teaching and learning. In this chapter
we will examine these guiding principles.
OUTCOMES
At the end of this chapter students will be able to:
1. explain the principles of good assessment in mathematics instruction; and
2. construct systematically a good test.
3.1
OUM
33
UNIT 1
CHAPTER 3
The assessment of student learning begins with educational values what are the kinds of
learning intended for the learners and how to go about achieving the values.
2.
3.
Assessment works best when the programs it seeks to improve have clear and explicit purposes
- with clear and implemental goals assessment can be more focused and useful.
4.
Assessment requires attention to outcomes but also and equally important is students
experiences that lead to those outcomes this means to improve outcomes special effort
about student experience along the way, curricula, teaching and interactions need to be given
attention.
5.
Assessment works best when it is ongoing and not episodic improvement of teaching and
learning is best fostered when assessment entails linked series of activities undertaken over
time.
6.
Assessment fosters wider improvement when representatives from across the educational
community are involved.
7.
Assessment makes a difference when it begins with issues of use and illuminates questions
that people really care about.
8.
Assessment is most likely to lead to improvement when it is part of a larger set of conditions
that promote change.
9.
34
OUM
UNIT 1
CHAPTER 3
2.
The mathematics is embedded in worthwhile problems that are part of the students world. This
also means that students are learning mathematic is engaging, educative and authentic.
3.
Methods of assessment should be such that they enable students to reveal what they know,
rather than what they do not know.
4.
A balanced assessment plan should include multiple and varied opportunities or formats for
students to demonstrate and document their achievements.
5.
Tasks should be able to operate all goals of the curricula hence considering different levels of
mathematical thinking.
6.
Grading criteria should be public and consistently applied; should include examples of earlier
grading showing exemplary work and work that is less than exemplary.
7.
The assessment process, including scoring and grading, should be open to students.
8.
9.
The quality of a task is not defined by its accessibility to objective scoring, reliability, or validity
in the traditional sense but by authenticity, fairness and the extent to which it meet the above
principles.
OUM
35
UNIT 1
CHAPTER 3
Does the assessment cover the mathematic topics that have been taught and focused during
class activities?
Does the assessment method using portfolio for algebra topic allow the teacher to make valid
decision about his or her instruction and assessment?
Do the assessment questions allow students to demonstrate the performance that the teacher
wants to assess?
Does the assessment cover the important aspects of what the teacher want to assess?
Does the teacher provide scoring procedures clearly, consistently and unbiased?
Does the teacher provide clear directions and wording of the mathematics items clear enough
that students will know what is to be expected in their answers?
Does the teacher present items of varying task of difficulty, sufficient numbers of easy questions
as well as problem solving items in order to assess problem solving performance of the students?
Figure 3.4: Some examples of questions that help to determine the
validity of mathematics classroom assessment
36
OUM
UNIT 1
CHAPTER 3
Besides the questions covered in Figure 3.4, can you think of other questions
that could determine validity of classroom assessment?
Validity is concerned with this general question: To what extent will this assessment information
help me make an appropriate decision?
Validity refers to the decision that are made from assessment information, the assessment
approach itself. It is not appropriate to say the assessment information is valid unless the
decisions or groups it is valid for are identified. Assessment information valid for one decision or
group of pupils is not necessarily valid for other decision or group.
Validity is a mater of degree; it does not exist on an all-or-nothing basis. Think of assessment
validity in terms of categories: highly valid, moderately valid, and invalid.
3.1.2
Issues of Reliability
OUM
37
UNIT 1
CHAPTER 3
Obtaining too small sample of behavior or intended learning outcomes to permit the students to
show consistent or stable performance.
Figure 3.6: Some of the possible influences of assessment reliability
Reliability is not concerned with the appropriateness of the assessment information collected,
only with its consistency, stability, or typicality. Appropriateness of assessment information is
a validity concern.
Reliability does not exist on all-or-nothing basis, but in degrees: high, moderate, or low. Some
types of assessment information are more reliable than others.
Reliability is a necessary but insufficient condition for validity. An assessment that provides
inconsistent, atypical results cannot be relied upon to provide information useful for decision
making.
Figure 3.7: Key aspects of assessment reliability
3.1.3
38
OUM
UNIT 1
CHAPTER 3
Informing student about teacher expectations and assessments before beginning teaching and
assessment.
2.
Describing for pupils what they are to be assessed on before actually assessment.
3.
Being cautious about making snap judgments by labeling pupils with emotional labels (e.g.,
disinterested, at-risk, slow learner) before you have spent time with them.
4.
Avoiding stereotyping pupils (e.g., Kids from that part of town are troublemakers. Students
who dress that way have no interest in school.).
5.
Avoiding terms and examples that may be offensive to students of different gender, race, religion,
culture, or nationality.
6.
7.
8.
Respecting learners diversities or disabilities and ensuring that pupil participation and interactions
are not limited on the basis of diversity or disability.
Figure 3.8: Ethical Issues and Responsibilities when assessing
3.2
Tests and other assessment tools serve a variety of uses in the schools mainly in
making educational decisions concerning teaching process, learning process, selection
purposes, placement process, and certification of mastery, aptitude scores, and attitude
tendency of students. Validity is an important aspect in assessment. Figure 3.9 shows
the suggested ways to ensure validity.
OUM
39
UNIT 1
CHAPTER 3
The objectives or learning outcomes of the course are to be measured (whether they are achieved).
The evaluation should sample the students abilities on majority of the objective or
learning outcome?.
Carefully matching the test with the course objectives, content and teaching approaches.
Increase the sample of learning objectives hence content areas and level of questions included
in any given tests - use test blueprint or test specification table for this purpose.
Using test methods that are appropriate for the objectives specified.
Ensuring adequate security and supervision in conducting the test to avoid cheating.
Figure 3.9: Ways to ensure validity
Decide on the different types of test you are going to use (refer to Chapter 2).
Use the test blueprint starting with lowest cognitive level and the first content coverage and
construct the test items.
40
OUM
UNIT 1
CHAPTER 3
Check on the content - decide on the proportions of content coverage whether the
test prepared is for a monthly test, mid-semester test or end-of-semester test.Most
importantly, be sure that the topics or coverage to be assessed.
2.
Check on the learning objectives -decide on the proportions learning objectives based
on the different cognitive levels or taxonomy level (refer to previous Module SBEM3303 Kaedah
Pengajaran Matematik).
3.
For example, students should be able to define the radius, diameter and
circumference of a circle in their own words.
Distinguish and identify the levels of these objectives based on Blooms taxonomy.
These objectives can also be classified as Knows (K), Understand (U), Applies,
Analyze, Synthesis and Evaluate (A).
4.
5.
Check to ensure the incorporation of the different contents covered and the different
levels of learning taxonomy
6.
Discuss and agree on the test blueprint with your peers before constructing the test
items
OUM
41
UNIT 1
CHAPTER 3
Topics
1.
2.
Whole numbers
Squaring of numbers and
factorization
Fractions
Decimal numbers
Percentage
Negative numbers
Measurement
Angle and parallel lines
Polygon
Perimeter and areas
Solids and Volume
Algebraic functions
Linear equations
Algebra formula
Rate and ratio
Plane geometry and coordinate
Loci
Circle
Transformation
Statistics
Index
Linear inequalities
Graphs and function
Trigonometry
Total
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
Cognitive Level
Kno
2
2
Total
App
Com
1
2
4
3
1
1
2
2
2
2
5
2
1
4
2
2
2
1
3
2
2
1
2
2
3
1
1
1
1
2
1
2
1
1
2
1
1
1
2
1
1
2
1
1
2
1
2
1
2
2
1
2
Figure 3.12 shows the factors to consider in determining the number of test items to use.
In your opinion, what steps were taken to create the sample of test
blueprint shown in Figure 3.12?
42
OUM
UNIT 1
CHAPTER 3
Time available - depending on types of test: short test, regular test or final examination.
Type of test items used - multiple choice items require more time than true-false or short
answer questions.
Keep in mind that it is desirable to give all students an opportunity to complete the test.
To interpret students performance (to gauge students performance) it is wise to use at least 10
test items per learning outcome.
Figure 3.13: Factors to Consider in Determining the Number of Test Items to Use
Knowledge
10 to 25%
Comprehension
20 to 35%
Application
20 to 25%
10 to 15%
OUM
43
UNIT 1
CHAPTER 3
What should test items do? What is the purpose of the test?
2.
3.
4.
5.
The number of test items needed for a power test or achievement tests can be based on the
type of item used short answer or essay items require longer time therefore fewer number
of test items
ability level of the students- test of advanced class should be shorter than slow learning
class
length and complexity of these items more stimulus materials, fewer number of test items
type of level objective being tested recall item require shorter period
amount of computation involved in the test .
6.
The typical student will require 30 to 45 seconds to read and answer simple factual type MCQ
or T-F.
7.
The typical student will require 75 to 100 seconds to read and answer fairly complex MCQ
requiring problem solving.
8.
9.
Use verbs describing the behavior listed in the learning objectives for behavior such as recall
and comprehend use T-F, Matching or MCQ items
10.
Use verbs describing the behavior listed in the learning objectives for behavior such as apply,
analyze and organize use MCQ or essay items
Figure 3.15: Writing test items
44
OUM
UNIT 1
CHAPTER 3
Figure 3.16 shows the considerations for writing test items based on level of difficulty.
2.
The item difficulty is determined by dividing the number of students getting correct answer by
the total number of students
if 40% of students answered correctly, then the item difficulty is 40% or .40
if 75% of students answered correctly, then the item difficulty is 75% or .75
3.
In general difficulty of an item should be halfway between the number of correct answers by
guessing and 100%.
4.
In general, MCQ with 5 choice should indicate approximately .60 level of difficulty.
5.
In general, MCQ with 4 choice should indicate approximately 0.62 level of difficulty.
6.
In general, True-false with 2 choice should indicate approximately 0.75 level of difficulty.
7.
8.
Difficult items passed by 30% or 40% of the students and some easy items passed by 80% to
90% of the students.
Figure 3.16: The considerations for writing test items based on level of difficulty
Figure 3.17 shows the other considerations when writing test items.
Would you take into account all considerations for writing test items based on
level of difficulty? If not, which consideration would you omit?
OUM
45
UNIT 1
CHAPTER 3
Ease of scoring
Ease of cheating
Sequencing of items
Test directions
Marking schemes
3.3
This is a measure of consistency and precision with which it tests what it is supposed
to test. Theoretically, a reliable test should produce the same result if administered to
the same students on two separate occasions. Statistically, splitting the test into two
parts and assume that these parts are equivalent could be accepted - reliable test
shows a high degree of correlation between students performance in each half of the
test.
Figure 3.18 shows the ways to improve reliability.
Ensuring that questions are clear and suitable for the level of the students
Developing a marking scheme of high quality (explicit and agreed criteria, checking of marks,
several skilled examiners).
When using less reliable test methods increase the number of questions, observations or
examination time.
Figure 3.18: The ways to improve reliability
46
OUM
UNIT 1
CHAPTER 3
The resulting test scores are correlated, and this correlation coefficient provides a measure
of stability; that is, it indicates how stable the test results are over the given period of
time.
If the results are highly stable, those pupils who scored high on one administration of the
test will tend to score high on the other administration, and the remaining pupils will tend
to stay in their same relative positions on both administration.
The correlation coefficient may vary from a perfect positive relationship which is indicated
by 1.00 and a zero relationship by 0.00.
Measures of stability in the .80s and .90s are commonly reported for standardized tests
of aptitude and achievement over occasions within the same year.
One important factor to keep in mind when interpreting measures of stability is the time
interval between tests.
If this time interval is short, say a day or two, the constancy of the results will be inflated
because pupils will remember some of their answers from the first test.
If the time interval is long, say about a year, the results will be influenced not only by the
instability of the testing procedure but also by actual changes in the pupils going over
that period of time.
In general, the longer the time interval is between test and retest, the more the results
will be influenced by changes in the pupil characteristics being measured and the smaller
the reliability coefficient will be.
The best time interval between test administrations will depend largely on the use to be
made of the results.
If, for example, college admission test scores can be summated as part of an application
to college several years after the test was taken, then stability over several years is quite
important.
But stability over a long period of time is neither important nor desirable for a unit test in
a course designed to assess mastery of certain concepts and readiness to move on to
new material.
Thus, for some decisions we are interested in reliability coefficients based on a long
interval between test and retest, and for others, reliability coefficients based on a short
interval may be sufficient.
The important thing is to seek evidence of stability that fits the particular interpretation to
be made.
Most teachers will not find it possible to compute test-retest reliability coefficients for
their own classroom test.
OUM
47
UNIT 1
CHAPTER 3
However, in choosing standardized tests, the stability of scores serves as one important
criterion.
The test manual should provide evidence of stability, indicating the time interval between
tests and any unusual experiences the group members might have had between testing.
Information concerning the stability of test scores also has implications for the use of
test results from school records and for the frequency of retesting.
When using any test score from permanent records, one should check the date of testing
and the stability data available to determine whether the results are still dependable. If
there is doubt and the decision is important, retesting is in order.
The two forms of the test are administered to the same group of pupils in close
successions, and the resulting test scores are correlated.
Thus, it indicates the degree to which both forms of the test are measuring the same
aspects of behavior.
The equivalent-forms method tells us nothing about the long-term stability of the pupil
characteristic being measured but, rather, reflects short-term constancy of pupil
performance and the extent to which the test represents an adequate sample of the
characteristic being measured.
In achievement testing, for example, there are thousands of questions that might be
asked in a particular test. But because of time limits and other restricting factors, only
some of the possible questions can be used.
The questions included in the test should provide an adequate sample of the possible
questions in the area.
The easiest way to estimate if a test measures an adequate sample of the content is to
construct two forms of the test and correlate the results. A high correlation indicates that
both forms are providing similar results and therefore are probably reliable samples of
content were being measured.
This method overcomes the timing and interval between tests problem.
However, its limited use is widely due to the necessity of two or more forms to be made
available.
This method is the most rigorous test of reliability due to: stability of the testing procedures,
the constancy of pupil characteristic being measured, and the representativeness of the
sample of the tasks.
48
OUM
UNIT 1
CHAPTER 3
The test is administered to a group of learners in the usual manner and then the set is
divided into half for scoring purposes.
Split the test into halves so that equivalent forms are available, the usual procedure is to
score the even-numbered and the odd-numbered items separately.
It indicates the degree to which consistent results are obtained from the two halves of
the test.
To estimate the scores reliability based on the full length test, the Spearman-Brown
formula is usually applied:
Reliability on full test
The split-half method is similar to the equivalent-form method as it indicates the extent
to which the sample of test items is a dependable sample of the content being measured.
A high correlation between scores of the two sets of test denotes the equivalence of the
two halves.
Split-half reliability tend to be higher than that of the equivalent-form method because in
the split-half, the source of inconsistency occurs less as the administration of the test is
based on a single test.
Inconsistency such as different forms, speed of work, fatigue, and test content are more
in control.
This method provides measure of internal consistency however, without splitting half
the test for scoring purposes.
This method measures the extent which items within one form of the test have as much
in common with one another.
Kuder-Richardson estimate can be thought as the average of all of the possible splithalf coefficients for the groups tested.
OUM
49
UNIT 1
CHAPTER 3
The simplicity of applying this method has led to its widespread use in determining
reliability.
However, such internal consistency measures are not appropriate for speeded test.
For speeded tests, reliability obtained by test-retest or equivalent-forms method should
be used.
On the other hand, this method poses no great problem for or teacher-made tests since
they are power tests.
For standardized tests, time limits are seldom so liberal that all students manage to
complete the test. Thus use of Kuder-Richardson is appropriate unless there is evidence
of speed of work is a factor.
Another limitation of internal consistency procedures is that they do not indicate the
constancy of learners response from one session to the other. Thus, time interval is not
regarded and which does not indicate the extent to which test results are generalizable
over different periods of time.
3.4
PRACTICALITY
This is pertaining to questioning whether the test is practical in terms of time and
resources. Can the results be interpreted accurately? Can the test be administered,
marked and graded? Does the test take too much time? Does administering and
grading the test need special resources? The following are some considerations in the
purpose of a test.
Begin writing items far enough in advance that you will have time to revise them.
Match items to intended outcomes at the proper difficulty level to provide a valid measure
of instructional objectives.
Be sure each item deals with ONLY ONE IDEA OR ASPECT of the content area and not
with trivia.
Group items according to item type so students do not continuously shift response
patterns.
Be sure that the each item is independent of all other items. The answer to one item
should not be required as a condition for answering the next item, nor should a hint to
one answer be unintentionally embedded in another item.
Be sure that item has one correct or best answer on which experts would agree.
50
OUM
UNIT 1
CHAPTER 3
Avoid quoting directly from textual materials. Besides, taken out of context, direct quotes
from the text are often ambiguous.
The stem of the item should clearly formulate a problem. Include as much of the item as
possible, keeping the response options as short as possible. However, include only the
material needed to make the problem clear and specific.
Be concise-dont add extraneous information. Be sure that there is one and only one
correct or clearly best answer.
Include from three to five options (two to four distractors plus one correct answer) to
optimize testing for knowledge rather than encouraging guessing.
Use the option none of the above sparingly and only when the keyed answer can be
classified unequivocally as right or wrong. Dont use this option when asking for a best
answer.
Avoid using the phrase all of the above. It is usually the correct answer and makes the
item too easy for students with partial information.
Scoring is highly objective, requiring only a count of the number of correct responses.
Multiple-choice items can be written so that students must discriminate among options
that vary in degree of correctness. This allows students to select the best alternative
and avoids the absolute judgments found in T-F tests.
If not carefully written, multiple-choice questions can sometimes have more than one
defensible correct answer.
OUM
51
UNIT 1
CHAPTER 3
The desired method of marking true or false should be clearly explained before students
begin the test.
Construct statements that are definitely true or definitely false, without additional
qualifications. If opinion is used, attribute it to some source.
Keep true and false statements at approximately the same length, and be sure that
there are approximately equal numbers of true and false items.
Avoid using double-negative statements. They take extra time to decipher and are difficult
to interpret.
T-F questions tend to be short hence more material can be covered than with any other
item format. Thus, T-F item tends to be used when a great deal of content has been
covered.
T-F questions take less time to construct. But avoid taking statements directly from the
text and modifying them slightly to create an item.
Scoring is easier with T-F questions. But avoid having students write true or false or
a T or F. Instead have them circle T or F provided for each item.
T-F questions presume that the answer to the question or issue is unequivocally true or
false. It would be unfair to ask the student to guess at the teachers criteria for evaluating
the truth of a statement.
T-F questions allow for and sometimes encourage a high degree of guessing. Generally,
longer examinations are needed to compensate for this.
Matching questions can be more efficient than multiple-choice questions because they
avoid repetition of options in measuring associations.
52
OUM
UNIT 1
CHAPTER 3
Rank the test papers in order from highest to the lowest score.
Select 10 papers with the highest total score and 10 papers with the lowest total score.
For each test item, tabulate the number of students in the upper and lower groups who
selected each alternative.
For example:
o
o
p = R/T X 100
Some guidelines:
o
0 to 25 %
25 to 75 %
75 to 100%
OUM
53
UNIT 1
CHAPTER 3
Rupper - Rlower
1
Total
2
D = (10 - 4) / 10 = .60
This indicate an average discriminating power
There are three types of discrimination indexes:
1.
Positive discrimination index indicate that those who did well on the overall test
chose the correct answer for a particular item more often than those who did poorly on
the overall tests.
2.
Negative discrimination index - Those who did poorly on the overall test chose the
correct answer for a particular item more often than those who did well on the overall
test.
3.
Zero discrimination index - Those who did well and those who did poorly on the overall
test chose the correct answer for a particular item with equal frequency.
-1.00 TO -0.25
-0.25 TO +0.25
+0.25 TO +1.00
54
OUM
UNIT 1
CHAPTER 3
+0.4 >
+0.20 to +0.39
+0.10 to +0.19
< +0.10
Exercise 3.1
1.
2.
3.
4.
SUMMARY
In this chapter we have discussed matters pertaining to principles in assessment of
mathematics instruction mainly on constructing achievement tests. Basically the
process is:
x
Instructional procedures are implemented that lead to the achievement of these objectives.
A test blueprint is drawn to ensure that each important content and process area is
adequately sampled by the appropriate number and kind of test items.
Test items are written. The format, number and level are determined in part by the
objectives, in part by the test blueprint, and in part by the teachers judgment.
Test items are reviewed and where necessary edited or replaced by panel of validates.
The test is assembled and reproduced to ensure that copies are legible.
Items that look marginal are subjected to quantitative and qualitative analysis.
Test papers are return to students and notes are made for deficient or problematic
items.
OUM
55