Final Output

Reporters:
Joelyn M. Montenegro
Orlynjoy B. Espiritu
Topic:
Characteristics
of a Good Test
Introduction:
A test is an instrument or systematic

procedure for observing and describing
one or more characteristics of students,
using either a numerical scale or
classification scheme.
Introduction:
A test may consist of a single item or a

combination of items. Regardless of the
number of items in a test, every single
item should possess certain
characteristics. Having good items,
however does not necessarily lead to a
good test, because a test as a whole is
Introduction:
more than a mere combination of

individual items. Therefore, in addition
to having good items, a test should have
certain characteristics or qualities.
The characteristics of a good test
can be classified in two categories:
(i) Practical Criteria

(ii) Technical Criteria
Practical Criteria
1. Economy
2. Purpose
3. Acceptability
4. Adequacy
5. Usability
6. Meaningfulness of Test Scores
7. Comparability
Technical Criteria
1. Items
2. Standardization
3. Objectivity
4. Discrimination
5. Norms
6. Validity
7. Reliability
Practical Criteria of a
Good Test
1. Economy
First of all, we are concerned with the

economy of time. Tests requiring long
time are not acceptable to students,
parents, as well as markers. Other
things being equal, shorter tests should
be preferred over longer ones. At the
same time, we should remember that
1. Economy
too short a test would be lacking in its

reliability and validity.
Secondly, there is question of economy
in cost. A test should be within our
financial resources. Standardized and
reusable tests with separate answer
sheets are always economical than
1. Economy
the tests which are usable only once.

The testing can also be made economical
by group testing in place of individual
testing.
2. Purpose
A test can fulfill only the purpose for

which it has been standardized. The
test manual should be examined
thoroughly to judge the suitability of
the test for our use. If the test is
constructed by the teacher, its utility
depends largely upon his foresight
2. Purpose
in so planning the test and its use that

the results will serve the desired
purposes.
While selecting students for a course,
or selecting candidates for a
professional course, or selecting young
men as salesmen, we would require
2. Purpose
tests which have been specifically

prepared for respective purposes.
3. Acceptability
A good test is acceptable to the

testees for whom it is intended. It
should be acceptable to them in spite of
the varying circumstances and situations.
A too easy or a too difficult test will
not be acceptable to any concerned
group.
3. Acceptability
Its acceptability increases as people

obtain desirable results from it year
after year. The genuine claims made
about it in the manual also mark it
acceptable. Its use should not result
into objections and criticisms from
various sides. Almost everybody
3. Acceptability
should feel satisfied from the results

obtained from it. It should appear to
cover comprehensively the courses for
which it is designed and nothing beyond.
4. Adequacy
Adequacy is a prerequisite to the

reliability and validity of test. We
cannot assume that a comprehensive test
is capable of measuring all the elements
of knowledge and skills that a learner
must acquire in completing a course. Our
test items should widely represent
4. Adequacy
all types of outcomes expected of pupils.

The sample items should give the scores
as representatives of the pupils
achievement for the entire covered area.
The test should be adequate from all
angles of contents, age, grade, local
emphases, expected learning
4. Adequacy
outcomes, objectives, and other related

factors.
5. Usability
The usability of a test depends upon a

number of factors. For example, a test
which can be handled adequately by the
regular classroom teacher without much
of special briefing is better suited than
a test requiring specially trained
administrators.
5. Usability
Ease of administration may be another

factor. The test should contain clear
and complete instructions so that all the
examinees read them and follow them
equally well. The sub-tests should not be
too many in number. Instructions should
provide appropriate
5. Usability
practice exercises. The layout of the

test should be such that pupils have no
difficulty in reading the items, in
recording their answers, in moving from
one page to the next or from one part to
the next. The page size, length of line,
size and style of type, and other
5. Usability
mechanical features should facilitate the

administration of test.
Ease of scoring is another related
factor. The results of the test should
be obtainable in a simple, rapid, and
routine manner. It is desirable, if a test
can be scored accurately
5. Usability
by clerical workers also. If

mathematical manipulations are required
to convert original raw scores into
derived scores, a foolproof table of
conversion should be a part of the test
manual.
5. Usability
Ease of interpretation is another

important factor. In the final analysis,
the success or failure of testing
programme is determined by the ease
and accuracy of interpretation. A good
test is one which can easily be
interpreted by an average teacher.
Generally we get a single score from a

test. The single score becomes
meaningful in view of the specific
purpose of the test. But there are tests
which yield several scores. The single
score is likely to be more meaningful
than the several different scores.
In a battery of tests, we have to

specify what the overall score conveys,
what scores on separate sub-tests
convey, or what the various combinations
of scores convey to us.
7. Comparability
A test possesses comparability, when

scores resulting from its administration
can be interpreted in terms of a common
base that has natural or accepted
meanings. There are two ways by which
comparability of results of standardized
7. Comparability
tests is established: (i) availability of

parallel forms of the test and (ii)
availability of adequate norms.
The test manual should provide
adequate tables of norms for different
age or levels and different types of
abilities it measures. Through
7. Comparability
norms the scores of an individual can be

compared with age and grade norms.
Through parallel forms, the individuals
or groups can be compared from class to
class, school to school and year to year.
Technical Criteria of a
Good Test
1. Items
Items of good quality is the first

requisite of a good test. An item on
which good students surpass the poor
ones is judged satisfactory, while the
one which shows no difference in
respect of good and poor students or in
which the poor group is more successful
1. Items
than the bright group is not

satisfactory.
1. Items
The item fails to discriminate between

good and poor students on account of
anyone of these reasons:
(i) It is so easy that everyone passes it
or it is so hard that everyone fails it.
(ii) It is ambiguous or confusing.
1. Items
The item fails to discriminate between

good and poor students on account of
anyone of these reasons:
(iiii) It measures something different

from what the test as a whole measures.
1. Items
Good items automatically satisfy

various criteria of a good test, namely,
purpose, acceptability, adequacy,
usability, standardization, objectivity,
validity, reliability, discrimination, etc.
2. Standardization
Standardization ensures uniformity of

testing conditions through specified and
fixed procedure, apparatus and scoring.
The standardization involves the exact
materials employed, time limits,
instructions, demonstrations and
2. Standardization
every other detail of the testing

situation.
2. Standardization
The process of standardization is
carried out through many steps:
(i) Specifying the objectives and
outcomes of learning
(ii) Preparing suitable items in relation to
objectives, outcomes of learning,
contents, subtopics, weightage and
importance
2. Standardization
(iii) Selecting a representative sample
(iv) Administration of the test to this

sample
2. Standardization
(v) Preparing a scoring key
(vi) Working out reliability, validity and

norms
2. Standardization
A test without standardization

automatically loses most of the qualities
and characteristics of a good test.
3. Objectivity
Objectivity of a test has two aspects-

objectivity of items and objectivity of
scoring. Objectivity of an item implies
that it has the same meaning from
person to person. There should not be
any difference between the examiners
and examinees
3. Objectivity
interpretation of an item. If the

student takes an item in a different
sense, its objectivity will be considerably
reduced. Words like perhaps, always and
never also harm the objectivity of an
item.
3. Objectivity
Objectivity of scoring stands for the

uniformity of scores in the hands of
different markers. The personal
judgement of the marker should not
affect the scores. Variation in his mood
and feelings, his attitudes and
prejudices should have no bearing
3. Objectivity
on the scores being awarded by him.

Essay type tests are generally very
defective from this point of view.
4. Discrimination
The discriminating power of a test

directly affects its reliability and
validity. The test should detect or
measure small differences in
achievement apart from detecting good
from the poor. (item analysis)
5. Norms
Every good test must accompany the

requisite tables of norms. A raw score
provides only a numerical summary of a
pupils performance. Norms provide
adequate interpretation to raw scores.
Norms are levels of performance on a
test attained by well-defined
5. Norms
groups of examinees. There are

different types of norms, such as age
norms, grade norms, percentiles,
standard scores and quotients.
6. Validity
Validity is the most important

characteristic of a good test. Unless a
test is valid, it serves no useful function.
The measure of validity of a test
enables us to judge whether the test
measures the right thing for our
purpose. No test can be 100 percent
valid.
6. Validity
Validity is the extent to which a test

measures what it claims to measure. It
is vital for a test to be valid in order for
the results to be accurately applied and
interpreted.
6. Validity
Lee J Cronbach Validity is the extent

to which a test measures what it
purports to measure.
6. Validity
Thorndike A measurement procedure is

valid insofar as it correlates with some
measurement of success in the job for
which it is being used as a predictor.
6. Validity
Validity is the watchword or

foundation stone over which the entire
superstructure of testing is based.
A test designed to measure what a
student has learnt in mathematics should
measure his achievement in that course
and nothing else. If the test is so
6. Validity
constructed that an intelligent student

can determine the correct answer
without knowing the subject matter, the
test measures general intelligence
rather than achievement in mathematics.
Moreover, the results of a mathematics
test may have a high degree of
6. Validity
validity for computational skills, a low

degree of validity for mathematical
reasoning and no validity for showing the
applications of mathematics in music.
Validity is therefore specific; a test may
be valid for one purpose but not valid for
another.
Factors influencing the validity of a test:
1. Clear directions: When the directions

clearly indicate to the pupil how to
respond to the items, how to record the
responses, etc. his answers will improve
the validity of the test.
2. Language: If the vocabulary and the

sentence structure used in the questions
is not unnecessarily complicated, it will
make the test valid. On the contrary
the students might be knowing the
answer but fail to answer it correctly
simply because they do not understand
the language of the question. For

example, a test in science which uses
difficult language becomes a test in
reading comprehension and does not
measure what it intends to measure.
3. Medium of Expression: Take the case

when English is the medium of
instruction as well as examination, some
of the students who know the subject
matter very well fail in the subjects like
history or geography, only because they
fail to express the subject matter
through English. A test will be more

valid if its answers are demanded in a
language suitable for the students as a
medium of expression.
4. Difficulty level of items: Test items

which are either too easy or too difficult
will not provide discrimination among
students. This is against the validity of
the test.
5. Construction of test items: Poorly

constructed test items adversely affect
the validity of a test. Classroom tests
are so constructed that they measure
primarily the knowledge objective.
Important objectives like application,
thinking and skill are not covered
in these tests which consequently

invalidate the results.
6. Time limit: If time limit given in an

achievement test is inadequate, the fast
writer will get an advantage over the
slow writer. Instead of measurement of
achievement, the test will measure the
speed of writing. On the other hand, if
ample time is allowed in a speed test,
where time is the most important factor,

it will invalidate the results. The time
limit of a good test is specified in the
light of its try out and the process of
standardization.
7. Extraneous factors: Extraneous

factors have to be eliminated in order to
ensure the validity of a test. In essay
type tests or short answer type tests
the examiner is greatly influenced by
such factors as style of expression,
method of organizing the subject
matter, good handwriting, and coverage

of vastness through brevity. In an
objective type test, the length of
instructions, the vagueness of
instructions, the confusing or lengthy
statement of an item, bad arrangement
or format of the items and the
options of responses are some of the

extraneous factors.
options of responses are some of the

extraneous factors.
Kinds of Validity
1. Face Validity: Face validity means
that the test looks valid to the
examinees, administrators of the test
and other observers. Although emphasis
on face validity may not be justified, but
it cannot be called an unnecessary
feature of a test. If test content on
the face of it, appears irrelevant,
inappropriate, or childish, the
result will be poor cooperation on the
part of the examinees regardless of the
actual validity of the test.
2. Content Validity: Content validity
represents the extent to which a test
measures a representative sample of the
subject matter. If the test measures the
ability of the students in the content area
taught by the teacher and it tells us that
the objectives laid down for that content
area have been achieved, the test is said
to have content validity.
When the students complain that the
test items are out of syllabus they are
in reality questioning the content validity
of the test.
3. Construct Validity: Whenever we
want a test in terms of some
psychological trait, intellectual trait,
quality like reasoning ability, adjustment,
aspiration, and anxiety we are concerned
with construct validity. A construct
represents a psychological trait which
exists in every individual in some
measure. For example, reasoning
ability is a construct. A measure of
reasoning ability implies that there is a
quality called reasoning ability which
accounts for performance on the test.
Verifying such implications is the task of
construct validity. Construct validity is
the extent to which test performance
can be interpreted in terms of certain
psychological constructs. There
are three methods by which we can
determine the construct validity.
Firstly, we can find it by working out
its correlation with other tests
measuring the same construct. We can
establish the construct validity of our
intelligence test by computing its
correlation with the accepted tests by
Terman, Jalota or other authorities.
Secondly, we can find the construct
validity of a test by computing its
correlation with the scores of known
groups. When a highly intelligent group
obtains high scores or a poorly
intelligent group obtains low scores on
our test, it possesses construct validity.
Thirdly, construct validity is established
when scores on tests like intelligence
Tests, reasoning ability tests, etc., show
a regular increase with increasing age or
after training and orientation.
Sometimes, it is established, when after
a certain age, scores show resistance to
change even after training and
education.
2. Criterion Related Validity: Criterion-
related Validity investigates the
correspondence between the scores
obtained from the newly-developed
test and the scores obtained from
some independent outside criteria.
Depending on the time of administration,
two types exist:
Concurrent Validity
Predictive Validity
Concurrent Validity
Correlation between the test scores (new
test) with a recognized measure taken at
the same time.
Predictive validity
Comparison (correlation) of students'
scores with a criterion taken at a later
time (date).
7. Reliability
Reliable Unreliable
7. Reliability
A test is reliable to the extent that

repeated measurements give consistent
results for the individual. Reliability is
used to cover several aspects of
consistency of scores. It indicates the
extent to which individual differences in
the characteristics being measured
and the extent to which they are
attributable to chance errors.
The measure of reliability
characterizes the test when
administered under standard conditions
and given to a group similar to the
normative sample. Since all types of
reliability are concerned with the degree
of consistency, all of them can be
expressed in terms of a correlation
coefficient. Adequacy and objectivity of
a test represent two aspects of
reliability.
Reliability is synonymous with consistency.
It is the degree to which test scores for
an individual test taker or group of test
takers are consistent over repeated
applications.
No psychological test is completely

consistent, however, a measurement that is
unreliable is worthless.
Would you keep using these
measurement tools?
The consistency of test scores

is critically important in
determining whether a test can
provide good measurement.
Factors influencing reliability:
1. Length of the test: Length of the

test affects reliability directly. A test
of 100 items is more reliable than one
with 50 items. A longer test will provide
a more adequate sample of the behavior
being measured and the scores are apt
to be less influenced by chance factors.
Lengthening of a test is limited by a

number of practical considerations like
time, fatigue, boredom, limited stock of
good items.
Instead of lengthening the test four
times, say, we can use four parallel
forms of the test and then obtain the
mean
of the four scores of each candidate.

The reliability of the test obtained from
these mean scores will be almost the
same as we obtain by lengthening the
test four times.
2. Variability of the group: If the

range of the group is wide the reliability
coefficients obtained would also be high.
A very restricted range would provide a
low coefficient. Reliability coefficient
of a test administered to a sample of
students of several grades is higher
than that of a test given to the students

of a single grade because the former
would have a wider variability.
Even in the less extreme cases,
account must be taken of the variability
of talent within the group. Reliabilities
for age groups will tend to be higher
than for grade groups, because an age

group will usually contain a greater
spread of talent than a single grade. A
sample made up of children from a wide
range of socio-economic levels will tend
to yield higher reliabilities than a very
homogeneous one.
3. Ability level of subjects: Some tests

have high reliability for older students
and low reliability for younger ones
because older students have better
understanding level.
4. Range of the measuring instrument:

A yardstick may be found reliable to
differentiate the lines varying in length
from 40 cm to 50 cm, but same
yardstick may be unreliable if the lines
vary minutely from 40 cm to 50 cm. This
is true of an evaluation tool also.
A test which is reliable for a wide range

may not be reliable for a narrow range.
5. Objectivity of scoring: Objective

type tests in general produces more
reliable results than the subjective or
essay type tests. But objectivity does
not always stand for reliability. A test
with a high scoring objectivity may be
quite unreliable. The true-false
test is highly objective but it may be

unreliable because of the element of
chance or guess in selecting the answer.
6. Scoring technique: There is greater

possibility of errors if scoring is done by
hand. If tests are machine-scored,
there would be less number of mistakes
and reliability would be higher.
7. Difficulty of the test: Tests which

are too easy or too difficult for the
group will tend to provide scores of low
reliability, because in both cases
variability of the scores will be very low.
The differences among individuals will be
small, thus making the test unreliable.
8. Method of test construction: The

nature and form of test items, their
difficulty, sampling, nature of the
standardization group influence test
reliability. For example, an increase in
the number of alternate-response items
would increase reliability.
9. Testing conditions: Results obtained

from the administration of the test in a
classroom would not be the same as
obtained in a hall or open place. Slight
changes in time limits, shifts in their
emotional attitude, attitude of the
examiner, degree of motivation,
cheating, etc., all influence reliability of

the test.
10. Errors in the individual: Individual

differences and chance fluctuations
would also affect reliability. Some of
the chance fluctuations affecting scores
are momentary distractions, concern
about family, broken pencil or pen, a
sudden headache or stomach pain.
11. Instability of scores: In most

testing situations, an individuals score is
expected not only according to his
present standing but also his standing
for sometime to come. Score on
intelligence tests, for example, are
expected to remain relatively
stable over long periods of time. If a

childs IQ goes up or down by as much as
10 points over a period of two years, the
test is not reliable for use.
12. Ambiguity: When questions are

interpreted in different ways at
different times by the same student,
the test would not give consistent
results.
13. Giving of option: Giving a choice of

questions reduces the common base on
which different individuals may be
compared. Examinees may not attempt
the same items on a second
administration of the same test when it
is presented in different ways,
ie the change in the order of items of

multiple choice type. In the case of
true-false items, even, many students
suffer from a general tendency to
answer true rather than false.
There are four methods of evaluating the reliability of an
instrument:
Split-Half Reliability: Determines how much error in a test score is
due to poor test construction.
To calculate: Administer one test once and then calculate the reliability
index by the Kuder-Richardson formula 20 (KR-20) or the Spearman-Brown
formula.
Test-Retest Reliability: Determines how much error in a test score
is due to problems with test administration (e.g. too much noise
distracted the participant).
To calculate: Administer the same test to the same participants on two
different occasions. Correlate the test scores of the two administrations of
the same test.
Parallel Forms Reliability: Determines how comparable are two
different versions of the same measure.
To calculate: Administer the two tests to the same participants within a
short period of time. Correlate the test scores of the two tests.
Inter-Rater Reliability: Determines how consistent are two
separate raters of the instrument.
To calculate: Give the results from one test administration to two
evaluators and correlate the two markings from the different raters.
Split-Half Reliability
When you are validating a measure, you will most likely be
interested in evaluating the split-half reliability of your
instrument.
This method will tell you how consistently your measure assesses the
construct of interest.
If your measure assesses multiple constructs, split-half reliability will be
considerably lower. Therefore, separate the constructs that you are measuring
into different parts of the questionnaire and calculate the reliability separately
for each construct.
Likewise, if you get a low reliability coefficient, then your measure is probably
measuring more constructs than it is designed to measure. Revise your
measure to focus more directly on the construct of interest.
If you have dichotomous items (e.g., right-wrong answers) as you would
with multiple choice exams, the KR-20 formula is the best accepted
statistic.
If you have a Likert scale or other types of items, use the Spearman-Brown
formula.
KR-20
NOTE: Only use the KR-20 if each item has a right
answer. Do NOT use with a Likert scale.
r =( )( )
Formula: k pq
1
KR20 k-1 2
rKR20 is the Kuder-Richardson formula 20

k is the total number of test items
indicates to sum
p is the proportion of the test takers who pass an item
q is the proportion of test takers who fail an item
2 is the variation of the entire test
KR-20
I administered a 10-item spelling test to 15 children.
To calculate the KR-20, I entered data in an Excel
Spreadsheet.
In these columns, I marked a
This column lists
1 if the student answered the
each student.
item correctly and a 0 if the
student answered incorrectly.
Student Math Problem
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Sunday 1 1 1 1 1 1 1 1 1 1
Monday 1 0 0 1 0 0 1 1 0 1
Linda 1 0 1 0 0 1 1 1 1 0
Lois 1 0 1 1 1 0 0 1 0 0
Ayuba 0 0 0 0 0 1 1 0 1 1
Andrea 0 1 1 1 1 1 1 1 1 1
Thomas 0 1 1 1 1 1 1 1 1 1
Anna 0 0 1 1 0 1 1 0 1 0
Amos 0 1 1 1 1 1 1 1 1 1
Martha 0 0 1 1 0 1 0 1 1 1
Sabina 0 0 1 1 0 0 0 0 0 1
Augustine 1 1 0 0 0 1 0 0 1 1
Priscilla 1 1 1 1 1 1 1 1 1 1
Tunde 0 1 1 1 0 0 0 0 1 0
Daniel 0 1 1 1 1 1 1 1 1 1
r =(
KR20
k
)(
k-1
1
)
pq
2
k = 10
The first value is k, the number of items. My

test had 10 items, so k = 10.
Next we need to calculate p for each item, the
proportion of the sample who answered each
item correctly.
r =(KR20
k
k-1 )( 1
pq
2 )
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Sunday 1 1 1 1 1 1 1 1 1 1
Monday 1 0 0 1 0 0 1 1 0 1
Linda 1 0 1 0 0 1 1 1 1 0
Lois 1 0 1 1 1 0 0 1 0 0
Ayuba 0 0 0 0 0 1 1 0 1 1
Andrea 0 1 1 1 1 1 1 1 1 1
Thomas 0 1 1 1 1 1 1 1 1 1
Anna 0 0 1 1 0 1 1 0 1 0
Amos 0 1 1 1 1 1 1 1 1 1
Martha 0 0 1 1 0 1 0 1 1 1
Sabina 0 0 1 1 0 0 0 0 0 1
Augustine 1 1 0 0 0 1 0 0 1 1
Priscilla 1 1 1 1 1 1 1 1 1 1
Tunde 0 1 1 1 0 0 0 0 1 0
Daniel 0 1 1 1 1 1 1 1 1 1
Number of 1's 6 8 12 12 7 11 10 10 12 11
Proportion Passed (p) 0.40 0.53 0.80 0.80 0.47 0.73 0.67 0.67 0.80 0.73
To calculate the proportion of the

sample who answered the item Second, I divided the number of
correctly, I first counted the number of students who answered the item
1s for each item. This gives the total correctly by the number of students
number of students who answered the who took the test, 15 in this case.
r =(
KR20 )(
k
k-1
1
)
pq
2
Next we need to calculate q for each item, the

proportion of the sample who answered each
item incorrectly.
Since students either passed or failed each
item, the sum p + q = 1.
The proportion of a whole sample is always 1.
Since the whole sample either passed or failed an
item, p + q will always equal 1.
r =( KR20
k
k-1 )( 1
pq
2 )
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Number of 1's 6 8 12 12 7 11 10 10 12 11
Proportion Failed (q) 0.60 0.47 0.20 0.20 0.53 0.27 0.33 0.33 0.20 0.27
I calculated the percentage who failed

by the formula 1 p, or 1 minus the You will get the same answer if you
proportion who passed the item. count up the number of 0s for each
item and then divide by 15.
r =(
KR20
k
)(
k-1
1
)
pq
2
Now that we have p and q for each item, the

formula says that we need to multiply p by q
for each item.
Once we multiply p by q, we need to add up
these values for all of the items (the symbol
means to add up across all values).
r =(
KR20
k
k-1 )( 1
pq
2 )
pq = 2.05
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Number of 1's 6 8 12 12 7 11 10 10 12 11
Proportion Failed (q) 0.60 0.47 0.20 0.20 0.53 0.27 0.33 0.33 0.20 0.27
pxq 0.24 0.25 0.16 0.16 0.25 0.20 0.22 0.22 0.16 0.20
In this column, I took p times

q. For example, (0.40 * 0.60)
= 0.24
Once we have p x q for every item, we sum up

these values.
.24 + .25 + .16 + + .20 = 2.05
r =(
KR20
k
)(
k-1
1
)
pq
2
Finally, we have to calculate 2, or the variance

of the total test scores.
r =( )( 1 pq
KR20 )
k
k-1 2
For each student, I calculated
2 = 5.57
their total exam score by
counting the number of 1s they
had.
Total
Exam
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6 Score
Sunday 1 1 1 1 1 1 1 1 1 1 10
Monday 1 0 0 1 0 0 1 1 0 1 5
Linda 1 0 1 0 0 1 1 1 1 0 6
Lois 1 0 1 1 1 0 0 1 0 0 5
Ayuba 0 0 0 0 0 1 1 0 1 1 4
Andrea 0 1 1 1 1 1 1 1 1 1 9
Thomas 0 1 1 1 1 1 1 1 1 1 9
Anna 0 0 1 1 0 1 1 0 1 0 5
Amos 0 1 1 1 1 1 1 1 1 1 9
Martha 0 0 1 1 0 1 0 1 1 1 6
Sabina 0 0 1 1 0 0 0 0 0 1 3
Augustine 1 1 0 0 0 1 0 0 1 1 5
Priscilla 1 1 1 1 1 1 1 1 1 1 10
Tunde 0 1 1 1 0 0 0 0 1 0 4
Daniel 0 1 1 1 1 1 1 1 1 1 9
The variation of the Total Exam Score is the squared standard deviation. I discussed
calculating the standard deviation in the example of a Descriptive Research Study in
side 34. The standard deviation of the Total Exam Score is 2.36. By taking 2.36 * 2.36,
we get the variance of 5.57.
r =(
KR20
k
)(
k-1
1
pq
2)
k = 10
pq = 2.05
2 = 5.57
Now that we know all of the values in the
equation, we can calculate rKR20.
r =(
KR20 )(
10
10 - 1
1
)
2.05
5.57
r = 1.11 * 0.63
KR20
rKR20 = 0.70
Likert Tests
If you administer a Likert Scale or have another
measure that does not have just one correct answer,
the preferable statistic to calculate the split-half
reliability is coefficient alpha (otherwise called
Cronbachs alpha).
However, coefficient alpha is difficult to calculate by hand. If
you have access to SPSS, use coefficient alpha to calculate
the reliability.
However, if you must calculate the reliability by hand, use
the Spearman Brown formula. Spearman Brown is not as
accurate, but is much easier to calculate.
Coefficient Alpha Spearman-Brown
r = ( )( )
Formula
k 2
i 2rhh
k-1
1
2 r =1+r
SB
hh
where = variance of one test
i
2
item. Other variables are where rhh = Pearson

identical to the KR-20 formula. correlation of scores in the
Spearman Brown Formula
To demonstrate calculating the Spearman Brown formula, I used the PANAS
Questionnaire that was administered in the Descriptive Research Study.
See the PowerPoint for the Descriptive Research Study for more information on the
measure.
The PANAS measures two constructs via Likert Scale: Positive Affect and
Negative Affect.
When we calculate reliability, we have to calculate it for each separate construct
that we measure.
The purpose of reliability is to determine how much error is present in the test score. If we
included questions for multiple constructs together, the reliability formula would assume
that the difference in constructs is error, which would give us a very low reliability
estimate.
Therefore, I first had to separate the items on the questionnaire into essentially two
separate tests: one for positive affect and one for negative affect.
The following calculations will only focus on the reliability estimate for positive affect. We
would have to do the same process separately for negative affect.
2rhh These are the 10 items
r =1+r
that measured positive
affect on the PANAS.
SB
hh
Questionnaire Item Number

S/
No 1 3 5 9 10 12 14 16 17 19
The data for each
1 5 4 4 5 1 4 4 5 3 4
participant is the code for
2 5 3 3 4 5 2 3 4 5 4 what they selected for
3 5 3 3 4 5 2 3 4 5 4 each item: 1 is slightly or
4 5 3 3 4 3 4 3 5 4 3 not at all, 2 is a little, 3 is
5 5 3 3 4 3 4 3 5 4 3
moderately, 4 is quite a
bit, and 5 is extremely.
Fourteen participants 6 5 3 5 3 1 3 2 3 3 1
took the test. 7 5 4 4 4 3 3 3 4 4 3
8 5 3 5 4 3 4 4 4 5 3
9 5 5 3 4 2 5 3 5 5 4
10 5 4 3 4 3 4 4 4 4 4
11 4 4 4 3 3 4 4 4 5 4
12 4 3 3 3 1 3 3 5 4 3
13 5 5 3 3 2 3 4 4 3 3
14 3 2 2 3 1 4 3 4 3 4
The first step is to split the questions into half. The recommended
procedure is to assign every other item to one half of the test.
If you simply take the first half of the items, the participants may
have become tired at the end of the questionnaire and the reliability
estimate will be artificially lower.
2rhh
r =1+r
SB
hh
The first half total was
calculated by adding up
Questionnaire Item Number the scores for items 1,
S/ 1 Half 2 Half
5, 10, 14, and 17.
No 1 3 5 9 10 12 14 16 17 19 Total Total
1 5 4 4 5 1 4 4 5 3 4 17 22
2 5 3 3 4 5 2 3 4 5 4 21 17
3 5 3 3 4 5 2 3 4 5 4 21 17
4 5 3 3 4 3 4 3 5 4 3 18 19
5 5 3 3 4 3 4 3 5 4 3 18 19
6 5 3 5 3 1 3 2 3 3 1 16 13
7 5 4 4 4 3 3 3 4 4 3 19 18
8 5 3 5 4 3 4 4 4 5 3 22 18
9 5 5 3 4 2 5 3 5 5 4 18 23
10 5 4 3 4 3 4 4 4 4 4 19 20
11 4 4 4 3 3 4 4 4 5 4 20 19
12 4 3 3 3 1 3 3 5 4 3 15 17
The second half total
13 5 5 3 3 2 3 4 4 3 3 17 18 was calculated by
14 3 2 2 3 1 4 3 4 3 4 12 17 adding up the scores for
items 3, 9, 12, 16, and
19
Spearman Brown Formula
Now that we have our two halves of the test,
we have to calculate the Pearson Product-
Moment Correlation between them.
(X X) (Y Y)
rxy =[ (X X) ] [Y Y) ]
2 2
In our case, X = one persons score on the first half of items, X = the mean
score on the first half of items, Y = one persons score on the second half of
items, Y = the mean score on the second half of items.
(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2
1 Half 2 Half
S/No Total Total
1 17 22 The first half total is
2 21 17 X.
3 21 17
4 18 19
The second half total is
5 18 19
Y.
6 16 13
7 19 18
8 22 18
9 18 23
10 19 20
11 20 19
12 15 17 We first have to
13 17 18 calculate the mean for
14 12 17 both halves.
Mea
n 18.1 18.4
This is X. This is
Y
(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2
1 Half 2 Half
S/No Total Total X-X Y-Y To get X X, we take each
1 17 22 -1.1 3.6 second persons first half total
2 21 17 2.9 -1.4 minus the average, which is
18.1. For example,
3 21 17 2.9 -1.4
21 18.1 = 2.9.
4 18 19 -0.1 0.6
5 18 19 -0.1 0.6
6 16 13 -2.1 -5.4
7 19 18 0.9 -0.4
8 22 18 3.9 -0.4
9 18 23 -0.1 4.6
10 19 20 0.9 1.6 To get Y Y, we take each
11 20 19 1.9 0.6
second half total minus the
average, which is 18.4. For
12 15 17 -3.1 -1.4
example,
13 17 18 -1.1 -0.4 18 18.4 = -0.4
14 12 17 -6.1 -1.4
Mea
n 18.1 18.4
(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2
(X X) (Y Y) = 12.66
1 Half 2 Half
S/No Total Total X-X Y-Y (X - X)(Y - Y)
Next we multiply (X X) times (Y
1 17 22 -1.1 3.6 -3.96 Y).
2 21 17 2.9 -1.4 -4.06
3 21 17 2.9 -1.4 -4.06
4 18 19 -0.1 0.6 -0.06
5 18 19 -0.1 0.6 -0.06
6 16 13 -2.1 -5.4 11.34
7 19 18 0.9 -0.4 -0.36
8 22 18 3.9 -0.4 -1.56
9 18 23 -0.1 4.6 -0.46
10 19 20 0.9 1.6 1.44
11 20 19 1.9 0.6 1.14
After we have multiplied,
12 15 17 -3.1 -1.4 4.34
we sum up the products.
13 17 18 -1.1 -0.4 0.44
14 12 17 -6.1 -1.4 8.54
Mean 18.1 18.4 Sum 12.66

(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2
[(X X)2] [(Y Y)2] = 82.72

To calculate the
S/N 1 Half 2 Half denominator, we have
o Total Total X-X YY (X - X)2 (Y - Y)2 to square (X X) and (Y
1 17 22 -1.1 3.6 1.21 12.96 Y)
2 21 17 2.9 -1.4 8.41 1.96
3 21 17 2.9 -1.4 8.41 1.96
4 18 19 -0.1 0.6 0.01 0.36
5 18 19 -0.1 0.6 0.01 0.36
6 16 13 -2.1 -5.4 4.41 29.16
7 19 18 0.9 -0.4 0.81 0.16
8 22 18 3.9 -0.4 15.21 0.16
9 18 23 -0.1 4.6 0.01 21.16 Next we sum the squares across
10 19 20 0.9 1.6 0.81 2.56
the participants. Then we multiply
the sums.
11 20 19 1.9 0.6 3.61 0.36
90.94 * 75.24 = 6842.33.
12 15 17 -3.1 -1.4 9.61 1.96 Finally, 6842.33 = 82.72.
13 17 18 -1.1 -0.4 1.21 0.16
14 12 17 -6.1 -1.4 37.21 1.96
Sum 90.94 75.24

(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ]
2 2
(X X) (Y Y) = 12.66
[(X X)2] [(Y Y)2] = 82.72
Now that we have calculated the numerator

and denominator, we can calculate rxy
12.66
rxy = 82.72
rxy = 0.15
2rhh
r =1+r
SB
hh
Now that we have calculated the Pearson

correlation between our two halves (rxy = 0.15),
we substitute this value for rhh and we can
calculate rSB
2 * 0.15
rSB = 1 + 0.15
0.3
rSB = 1.15
rxy = 0.26 The measure did not

have good reliability
in my sample!
How High Should Reliability Be?
A highly reliable test is always preferable to a test with
lower reliability.
. 80 > greater (Excellent)

.70 to .80 (Very Good)
.60 to .70 (Satisfactory) <.60 (Suspect)
A reliability coefficient of .80 indicates that 20% of the

variability in test scores is due to measurement error.
Conclusion:
The present day theory of measurement

and evaluation has enabled us to formulate a
comprehensive list of qualities of a good test.
But nothing can be taken as final about these
qualities. A keen examiner and an inquisitive
evaluator will always find that there is need
and scope for improving even the best among
available tests.
True teachers are those who use
themselves as bridges over which
they invite their students to cross;
then, having facilitated their crossing,
joyfully collapse, encouraging them to
create
their own.
Nikos Kazantzakis

Final Output

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Final Output

Caricato da

Copyright:

Formati disponibili

Reporters:

A test is an instrument or systematic

A test may consist of a single item or a

more than a mere combination of

(i) Practical Criteria

First of all, we are concerned with the

too short a test would be lacking in its

the tests which are usable only once.

A test can fulfill only the purpose for

in so planning the test and its use that

tests which have been specifically

A good test is acceptable to the

Its acceptability increases as people

should feel satisfied from the results

Adequacy is a prerequisite to the

all types of outcomes expected of pupils.

outcomes, objectives, and other related

The usability of a test depends upon a

Ease of administration may be another

practice exercises. The layout of the

mechanical features should facilitate the

by clerical workers also. If

Ease of interpretation is another

Generally we get a single score from a

In a battery of tests, we have to

A test possesses comparability, when

tests is established: (i) availability of

norms the scores of an individual can be

Items of good quality is the first

than the bright group is not

The item fails to discriminate between

The item fails to discriminate between

(iiii) It measures something different

Good items automatically satisfy

Standardization ensures uniformity of

every other detail of the testing

(iv) Administration of the test to this

(vi) Working out reliability, validity and

A test without standardization

Objectivity of a test has two aspects-

interpretation of an item. If the

Objectivity of scoring stands for the

on the scores being awarded by him.

The discriminating power of a test

Every good test must accompany the

groups of examinees. There are

Validity is the most important

Validity is the extent to which a test

Lee J Cronbach Validity is the extent

Thorndike A measurement procedure is

Validity is the watchword or

constructed that an intelligent student

validity for computational skills, a low

1. Clear directions: When the directions

2. Language: If the vocabulary and the

the language of the question. For

3. Medium of Expression: Take the case

through English. A test will be more

4. Difficulty level of items: Test items

5. Construction of test items: Poorly

in these tests which consequently

6. Time limit: If time limit given in an

where time is the most important factor,

7. Extraneous factors: Extraneous