Sei sulla pagina 1di 144

Reporters:

Joelyn M. Montenegro

Orlynjoy B. Espiritu
Topic:
Characteristics
of a Good Test
Introduction:

A test is an instrument or systematic


procedure for observing and describing
one or more characteristics of students,
using either a numerical scale or
classification scheme.
Introduction:

A test may consist of a single item or a


combination of items. Regardless of the
number of items in a test, every single
item should possess certain
characteristics. Having good items,
however does not necessarily lead to a
good test, because a test as a whole is
Introduction:

more than a mere combination of


individual items. Therefore, in addition
to having good items, a test should have
certain characteristics or qualities.
The characteristics of a good test
can be classified in two categories:

(i) Practical Criteria


(ii) Technical Criteria
Practical Criteria
1. Economy
2. Purpose
3. Acceptability
4. Adequacy
5. Usability
6. Meaningfulness of Test Scores
7. Comparability
Technical Criteria
1. Items
2. Standardization
3. Objectivity
4. Discrimination
5. Norms
6. Validity
7. Reliability
Practical Criteria of a
Good Test
1. Economy

First of all, we are concerned with the


economy of time. Tests requiring long
time are not acceptable to students,
parents, as well as markers. Other
things being equal, shorter tests should
be preferred over longer ones. At the
same time, we should remember that
1. Economy

too short a test would be lacking in its


reliability and validity.
Secondly, there is question of economy
in cost. A test should be within our
financial resources. Standardized and
reusable tests with separate answer
sheets are always economical than
1. Economy

the tests which are usable only once.


The testing can also be made economical
by group testing in place of individual
testing.
2. Purpose

A test can fulfill only the purpose for


which it has been standardized. The
test manual should be examined
thoroughly to judge the suitability of
the test for our use. If the test is
constructed by the teacher, its utility
depends largely upon his foresight
2. Purpose

in so planning the test and its use that


the results will serve the desired
purposes.
While selecting students for a course,
or selecting candidates for a
professional course, or selecting young
men as salesmen, we would require
2. Purpose

tests which have been specifically


prepared for respective purposes.
3. Acceptability

A good test is acceptable to the


testees for whom it is intended. It
should be acceptable to them in spite of
the varying circumstances and situations.
A too easy or a too difficult test will
not be acceptable to any concerned
group.
3. Acceptability

Its acceptability increases as people


obtain desirable results from it year
after year. The genuine claims made
about it in the manual also mark it
acceptable. Its use should not result
into objections and criticisms from
various sides. Almost everybody
3. Acceptability

should feel satisfied from the results


obtained from it. It should appear to
cover comprehensively the courses for
which it is designed and nothing beyond.
4. Adequacy

Adequacy is a prerequisite to the


reliability and validity of test. We
cannot assume that a comprehensive test
is capable of measuring all the elements
of knowledge and skills that a learner
must acquire in completing a course. Our
test items should widely represent
4. Adequacy

all types of outcomes expected of pupils.


The sample items should give the scores
as representatives of the pupils
achievement for the entire covered area.
The test should be adequate from all
angles of contents, age, grade, local
emphases, expected learning
4. Adequacy

outcomes, objectives, and other related


factors.
5. Usability

The usability of a test depends upon a


number of factors. For example, a test
which can be handled adequately by the
regular classroom teacher without much
of special briefing is better suited than
a test requiring specially trained
administrators.
5. Usability

Ease of administration may be another


factor. The test should contain clear
and complete instructions so that all the
examinees read them and follow them
equally well. The sub-tests should not be
too many in number. Instructions should
provide appropriate
5. Usability

practice exercises. The layout of the


test should be such that pupils have no
difficulty in reading the items, in
recording their answers, in moving from
one page to the next or from one part to
the next. The page size, length of line,
size and style of type, and other
5. Usability

mechanical features should facilitate the


administration of test.
Ease of scoring is another related
factor. The results of the test should
be obtainable in a simple, rapid, and
routine manner. It is desirable, if a test
can be scored accurately
5. Usability

by clerical workers also. If


mathematical manipulations are required
to convert original raw scores into
derived scores, a foolproof table of
conversion should be a part of the test
manual.
5. Usability

Ease of interpretation is another


important factor. In the final analysis,
the success or failure of testing
programme is determined by the ease
and accuracy of interpretation. A good
test is one which can easily be
interpreted by an average teacher.
6. Meaningfulness of Test Scores

Generally we get a single score from a


test. The single score becomes
meaningful in view of the specific
purpose of the test. But there are tests
which yield several scores. The single
score is likely to be more meaningful
than the several different scores.
6. Meaningfulness of Test Scores

In a battery of tests, we have to


specify what the overall score conveys,
what scores on separate sub-tests
convey, or what the various combinations
of scores convey to us.
7. Comparability

A test possesses comparability, when


scores resulting from its administration
can be interpreted in terms of a common
base that has natural or accepted
meanings. There are two ways by which
comparability of results of standardized
7. Comparability

tests is established: (i) availability of


parallel forms of the test and (ii)
availability of adequate norms.
The test manual should provide
adequate tables of norms for different
age or levels and different types of
abilities it measures. Through
7. Comparability

norms the scores of an individual can be


compared with age and grade norms.
Through parallel forms, the individuals
or groups can be compared from class to
class, school to school and year to year.
Technical Criteria of a
Good Test
1. Items

Items of good quality is the first


requisite of a good test. An item on
which good students surpass the poor
ones is judged satisfactory, while the
one which shows no difference in
respect of good and poor students or in
which the poor group is more successful
1. Items

than the bright group is not


satisfactory.
1. Items

The item fails to discriminate between


good and poor students on account of
anyone of these reasons:
(i) It is so easy that everyone passes it
or it is so hard that everyone fails it.
(ii) It is ambiguous or confusing.
1. Items

The item fails to discriminate between


good and poor students on account of
anyone of these reasons:

(iiii) It measures something different


from what the test as a whole measures.
1. Items

Good items automatically satisfy


various criteria of a good test, namely,
purpose, acceptability, adequacy,
usability, standardization, objectivity,
validity, reliability, discrimination, etc.
2. Standardization

Standardization ensures uniformity of


testing conditions through specified and
fixed procedure, apparatus and scoring.
The standardization involves the exact
materials employed, time limits,
instructions, demonstrations and
2. Standardization

every other detail of the testing


situation.
2. Standardization
The process of standardization is
carried out through many steps:
(i) Specifying the objectives and
outcomes of learning
(ii) Preparing suitable items in relation to
objectives, outcomes of learning,
contents, subtopics, weightage and
importance
2. Standardization
The process of standardization is
carried out through many steps:
(iii) Selecting a representative sample

(iv) Administration of the test to this


sample
2. Standardization
The process of standardization is
carried out through many steps:
(v) Preparing a scoring key

(vi) Working out reliability, validity and


norms
2. Standardization

A test without standardization


automatically loses most of the qualities
and characteristics of a good test.
3. Objectivity

Objectivity of a test has two aspects-


objectivity of items and objectivity of
scoring. Objectivity of an item implies
that it has the same meaning from
person to person. There should not be
any difference between the examiners
and examinees
3. Objectivity

interpretation of an item. If the


student takes an item in a different
sense, its objectivity will be considerably
reduced. Words like perhaps, always and
never also harm the objectivity of an
item.
3. Objectivity

Objectivity of scoring stands for the


uniformity of scores in the hands of
different markers. The personal
judgement of the marker should not
affect the scores. Variation in his mood
and feelings, his attitudes and
prejudices should have no bearing
3. Objectivity

on the scores being awarded by him.


Essay type tests are generally very
defective from this point of view.
4. Discrimination

The discriminating power of a test


directly affects its reliability and
validity. The test should detect or
measure small differences in
achievement apart from detecting good
from the poor. (item analysis)
5. Norms

Every good test must accompany the


requisite tables of norms. A raw score
provides only a numerical summary of a
pupils performance. Norms provide
adequate interpretation to raw scores.
Norms are levels of performance on a
test attained by well-defined
5. Norms

groups of examinees. There are


different types of norms, such as age
norms, grade norms, percentiles,
standard scores and quotients.
6. Validity

Validity is the most important


characteristic of a good test. Unless a
test is valid, it serves no useful function.
The measure of validity of a test
enables us to judge whether the test
measures the right thing for our
purpose. No test can be 100 percent
valid.
6. Validity

Validity is the extent to which a test


measures what it claims to measure. It
is vital for a test to be valid in order for
the results to be accurately applied and
interpreted.
6. Validity

Lee J Cronbach Validity is the extent


to which a test measures what it
purports to measure.
6. Validity

Thorndike A measurement procedure is


valid insofar as it correlates with some
measurement of success in the job for
which it is being used as a predictor.
6. Validity

Validity is the watchword or


foundation stone over which the entire
superstructure of testing is based.
A test designed to measure what a
student has learnt in mathematics should
measure his achievement in that course
and nothing else. If the test is so
6. Validity

constructed that an intelligent student


can determine the correct answer
without knowing the subject matter, the
test measures general intelligence
rather than achievement in mathematics.
Moreover, the results of a mathematics
test may have a high degree of
6. Validity

validity for computational skills, a low


degree of validity for mathematical
reasoning and no validity for showing the
applications of mathematics in music.
Validity is therefore specific; a test may
be valid for one purpose but not valid for
another.
Factors influencing the validity of a test:

1. Clear directions: When the directions


clearly indicate to the pupil how to
respond to the items, how to record the
responses, etc. his answers will improve
the validity of the test.
Factors influencing the validity of a test:

2. Language: If the vocabulary and the


sentence structure used in the questions
is not unnecessarily complicated, it will
make the test valid. On the contrary
the students might be knowing the
answer but fail to answer it correctly
simply because they do not understand
Factors influencing the validity of a test:

the language of the question. For


example, a test in science which uses
difficult language becomes a test in
reading comprehension and does not
measure what it intends to measure.
Factors influencing the validity of a test:

3. Medium of Expression: Take the case


when English is the medium of
instruction as well as examination, some
of the students who know the subject
matter very well fail in the subjects like
history or geography, only because they
fail to express the subject matter
Factors influencing the validity of a test:

through English. A test will be more


valid if its answers are demanded in a
language suitable for the students as a
medium of expression.
Factors influencing the validity of a test:

4. Difficulty level of items: Test items


which are either too easy or too difficult
will not provide discrimination among
students. This is against the validity of
the test.
Factors influencing the validity of a test:

5. Construction of test items: Poorly


constructed test items adversely affect
the validity of a test. Classroom tests
are so constructed that they measure
primarily the knowledge objective.
Important objectives like application,
thinking and skill are not covered
Factors influencing the validity of a test:

in these tests which consequently


invalidate the results.
Factors influencing the validity of a test:

6. Time limit: If time limit given in an


achievement test is inadequate, the fast
writer will get an advantage over the
slow writer. Instead of measurement of
achievement, the test will measure the
speed of writing. On the other hand, if
ample time is allowed in a speed test,
Factors influencing the validity of a test:

where time is the most important factor,


it will invalidate the results. The time
limit of a good test is specified in the
light of its try out and the process of
standardization.
Factors influencing the validity of a test:

7. Extraneous factors: Extraneous


factors have to be eliminated in order to
ensure the validity of a test. In essay
type tests or short answer type tests
the examiner is greatly influenced by
such factors as style of expression,
method of organizing the subject
Factors influencing the validity of a test:

matter, good handwriting, and coverage


of vastness through brevity. In an
objective type test, the length of
instructions, the vagueness of
instructions, the confusing or lengthy
statement of an item, bad arrangement
or format of the items and the
Factors influencing the validity of a test:

options of responses are some of the


extraneous factors.
Factors influencing the validity of a test:

options of responses are some of the


extraneous factors.
Kinds of Validity
1. Face Validity: Face validity means
that the test looks valid to the
examinees, administrators of the test
and other observers. Although emphasis
on face validity may not be justified, but
it cannot be called an unnecessary
feature of a test. If test content on
the face of it, appears irrelevant,
inappropriate, or childish, the
result will be poor cooperation on the
part of the examinees regardless of the
actual validity of the test.
2. Content Validity: Content validity
represents the extent to which a test
measures a representative sample of the
subject matter. If the test measures the
ability of the students in the content area
taught by the teacher and it tells us that
the objectives laid down for that content
area have been achieved, the test is said
to have content validity.
When the students complain that the
test items are out of syllabus they are
in reality questioning the content validity
of the test.
3. Construct Validity: Whenever we
want a test in terms of some
psychological trait, intellectual trait,
quality like reasoning ability, adjustment,
aspiration, and anxiety we are concerned
with construct validity. A construct
represents a psychological trait which
exists in every individual in some
measure. For example, reasoning
ability is a construct. A measure of
reasoning ability implies that there is a
quality called reasoning ability which
accounts for performance on the test.
Verifying such implications is the task of
construct validity. Construct validity is
the extent to which test performance
can be interpreted in terms of certain
psychological constructs. There
are three methods by which we can
determine the construct validity.
Firstly, we can find it by working out
its correlation with other tests
measuring the same construct. We can
establish the construct validity of our
intelligence test by computing its
correlation with the accepted tests by
Terman, Jalota or other authorities.
Secondly, we can find the construct
validity of a test by computing its
correlation with the scores of known
groups. When a highly intelligent group
obtains high scores or a poorly
intelligent group obtains low scores on
our test, it possesses construct validity.
Thirdly, construct validity is established
when scores on tests like intelligence
Tests, reasoning ability tests, etc., show
a regular increase with increasing age or
after training and orientation.
Sometimes, it is established, when after
a certain age, scores show resistance to
change even after training and
education.
2. Criterion Related Validity: Criterion-
related Validity investigates the
correspondence between the scores
obtained from the newly-developed
test and the scores obtained from
some independent outside criteria.
Depending on the time of administration,
two types exist:
Concurrent Validity
Predictive Validity
Concurrent Validity
Correlation between the test scores (new
test) with a recognized measure taken at
the same time.
Predictive validity
Comparison (correlation) of students'
scores with a criterion taken at a later
time (date).
7. Reliability

Reliable Unreliable
7. Reliability

A test is reliable to the extent that


repeated measurements give consistent
results for the individual. Reliability is
used to cover several aspects of
consistency of scores. It indicates the
extent to which individual differences in
the characteristics being measured
and the extent to which they are
attributable to chance errors.
The measure of reliability
characterizes the test when
administered under standard conditions
and given to a group similar to the
normative sample. Since all types of
reliability are concerned with the degree
of consistency, all of them can be
expressed in terms of a correlation
coefficient. Adequacy and objectivity of
a test represent two aspects of
reliability.
Reliability is synonymous with consistency.
It is the degree to which test scores for
an individual test taker or group of test
takers are consistent over repeated
applications.

No psychological test is completely


consistent, however, a measurement that is
unreliable is worthless.
Would you keep using these
measurement tools?

The consistency of test scores


is critically important in
determining whether a test can
provide good measurement.
Factors influencing reliability:

1. Length of the test: Length of the


test affects reliability directly. A test
of 100 items is more reliable than one
with 50 items. A longer test will provide
a more adequate sample of the behavior
being measured and the scores are apt
to be less influenced by chance factors.
Factors influencing the validity of a test:

Lengthening of a test is limited by a


number of practical considerations like
time, fatigue, boredom, limited stock of
good items.
Instead of lengthening the test four
times, say, we can use four parallel
forms of the test and then obtain the
mean
Factors influencing the validity of a test:

of the four scores of each candidate.


The reliability of the test obtained from
these mean scores will be almost the
same as we obtain by lengthening the
test four times.
Factors influencing the validity of a test:

2. Variability of the group: If the


range of the group is wide the reliability
coefficients obtained would also be high.
A very restricted range would provide a
low coefficient. Reliability coefficient
of a test administered to a sample of
students of several grades is higher
Factors influencing the validity of a test:

than that of a test given to the students


of a single grade because the former
would have a wider variability.
Even in the less extreme cases,
account must be taken of the variability
of talent within the group. Reliabilities
for age groups will tend to be higher
Factors influencing the validity of a test:

than for grade groups, because an age


group will usually contain a greater
spread of talent than a single grade. A
sample made up of children from a wide
range of socio-economic levels will tend
to yield higher reliabilities than a very
homogeneous one.
Factors influencing the validity of a test:

3. Ability level of subjects: Some tests


have high reliability for older students
and low reliability for younger ones
because older students have better
understanding level.
Factors influencing the validity of a test:

4. Range of the measuring instrument:


A yardstick may be found reliable to
differentiate the lines varying in length
from 40 cm to 50 cm, but same
yardstick may be unreliable if the lines
vary minutely from 40 cm to 50 cm. This
is true of an evaluation tool also.
Factors influencing the validity of a test:

A test which is reliable for a wide range


may not be reliable for a narrow range.
Factors influencing the validity of a test:

5. Objectivity of scoring: Objective


type tests in general produces more
reliable results than the subjective or
essay type tests. But objectivity does
not always stand for reliability. A test
with a high scoring objectivity may be
quite unreliable. The true-false
Factors influencing the validity of a test:

test is highly objective but it may be


unreliable because of the element of
chance or guess in selecting the answer.
Factors influencing the validity of a test:

6. Scoring technique: There is greater


possibility of errors if scoring is done by
hand. If tests are machine-scored,
there would be less number of mistakes
and reliability would be higher.
Factors influencing the validity of a test:

7. Difficulty of the test: Tests which


are too easy or too difficult for the
group will tend to provide scores of low
reliability, because in both cases
variability of the scores will be very low.
The differences among individuals will be
small, thus making the test unreliable.
Factors influencing the validity of a test:

8. Method of test construction: The


nature and form of test items, their
difficulty, sampling, nature of the
standardization group influence test
reliability. For example, an increase in
the number of alternate-response items
would increase reliability.
Factors influencing the validity of a test:

9. Testing conditions: Results obtained


from the administration of the test in a
classroom would not be the same as
obtained in a hall or open place. Slight
changes in time limits, shifts in their
emotional attitude, attitude of the
examiner, degree of motivation,
Factors influencing the validity of a test:

cheating, etc., all influence reliability of


the test.
Factors influencing the validity of a test:

10. Errors in the individual: Individual


differences and chance fluctuations
would also affect reliability. Some of
the chance fluctuations affecting scores
are momentary distractions, concern
about family, broken pencil or pen, a
sudden headache or stomach pain.
Factors influencing the validity of a test:

11. Instability of scores: In most


testing situations, an individuals score is
expected not only according to his
present standing but also his standing
for sometime to come. Score on
intelligence tests, for example, are
expected to remain relatively
Factors influencing the validity of a test:

stable over long periods of time. If a


childs IQ goes up or down by as much as
10 points over a period of two years, the
test is not reliable for use.
Factors influencing the validity of a test:

12. Ambiguity: When questions are


interpreted in different ways at
different times by the same student,
the test would not give consistent
results.
Factors influencing the validity of a test:

13. Giving of option: Giving a choice of


questions reduces the common base on
which different individuals may be
compared. Examinees may not attempt
the same items on a second
administration of the same test when it
is presented in different ways,
Factors influencing the validity of a test:

ie the change in the order of items of


multiple choice type. In the case of
true-false items, even, many students
suffer from a general tendency to
answer true rather than false.
There are four methods of evaluating the reliability of an
instrument:
Split-Half Reliability: Determines how much error in a test score is
due to poor test construction.
To calculate: Administer one test once and then calculate the reliability
index by the Kuder-Richardson formula 20 (KR-20) or the Spearman-Brown
formula.
Test-Retest Reliability: Determines how much error in a test score
is due to problems with test administration (e.g. too much noise
distracted the participant).
To calculate: Administer the same test to the same participants on two
different occasions. Correlate the test scores of the two administrations of
the same test.
Parallel Forms Reliability: Determines how comparable are two
different versions of the same measure.
To calculate: Administer the two tests to the same participants within a
short period of time. Correlate the test scores of the two tests.
Inter-Rater Reliability: Determines how consistent are two
separate raters of the instrument.
To calculate: Give the results from one test administration to two
evaluators and correlate the two markings from the different raters.
Split-Half Reliability
When you are validating a measure, you will most likely be
interested in evaluating the split-half reliability of your
instrument.
This method will tell you how consistently your measure assesses the
construct of interest.
If your measure assesses multiple constructs, split-half reliability will be
considerably lower. Therefore, separate the constructs that you are measuring
into different parts of the questionnaire and calculate the reliability separately
for each construct.
Likewise, if you get a low reliability coefficient, then your measure is probably
measuring more constructs than it is designed to measure. Revise your
measure to focus more directly on the construct of interest.
If you have dichotomous items (e.g., right-wrong answers) as you would
with multiple choice exams, the KR-20 formula is the best accepted
statistic.
If you have a Likert scale or other types of items, use the Spearman-Brown
formula.
Split-Half Reliability
KR-20
NOTE: Only use the KR-20 if each item has a right
answer. Do NOT use with a Likert scale.

r =( )( )
Formula: k pq
1
KR20 k-1 2

rKR20 is the Kuder-Richardson formula 20


k is the total number of test items
indicates to sum
p is the proportion of the test takers who pass an item
q is the proportion of test takers who fail an item
2 is the variation of the entire test
Split-Half Reliability
KR-20
I administered a 10-item spelling test to 15 children.
To calculate the KR-20, I entered data in an Excel
Spreadsheet.
In these columns, I marked a
This column lists
1 if the student answered the
each student.
item correctly and a 0 if the
student answered incorrectly.

Student Math Problem

Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6

Sunday 1 1 1 1 1 1 1 1 1 1

Monday 1 0 0 1 0 0 1 1 0 1

Linda 1 0 1 0 0 1 1 1 1 0

Lois 1 0 1 1 1 0 0 1 0 0

Ayuba 0 0 0 0 0 1 1 0 1 1

Andrea 0 1 1 1 1 1 1 1 1 1

Thomas 0 1 1 1 1 1 1 1 1 1

Anna 0 0 1 1 0 1 1 0 1 0

Amos 0 1 1 1 1 1 1 1 1 1

Martha 0 0 1 1 0 1 0 1 1 1

Sabina 0 0 1 1 0 0 0 0 0 1

Augustine 1 1 0 0 0 1 0 0 1 1

Priscilla 1 1 1 1 1 1 1 1 1 1

Tunde 0 1 1 1 0 0 0 0 1 0

Daniel 0 1 1 1 1 1 1 1 1 1
r =(
KR20
k
)(
k-1
1
)
pq
2

k = 10

The first value is k, the number of items. My


test had 10 items, so k = 10.
Next we need to calculate p for each item, the
proportion of the sample who answered each
item correctly.
r =(KR20
k
k-1 )( 1
pq
2 )
Student Math Problem
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Sunday 1 1 1 1 1 1 1 1 1 1
Monday 1 0 0 1 0 0 1 1 0 1
Linda 1 0 1 0 0 1 1 1 1 0
Lois 1 0 1 1 1 0 0 1 0 0
Ayuba 0 0 0 0 0 1 1 0 1 1
Andrea 0 1 1 1 1 1 1 1 1 1
Thomas 0 1 1 1 1 1 1 1 1 1
Anna 0 0 1 1 0 1 1 0 1 0
Amos 0 1 1 1 1 1 1 1 1 1
Martha 0 0 1 1 0 1 0 1 1 1
Sabina 0 0 1 1 0 0 0 0 0 1
Augustine 1 1 0 0 0 1 0 0 1 1
Priscilla 1 1 1 1 1 1 1 1 1 1
Tunde 0 1 1 1 0 0 0 0 1 0
Daniel 0 1 1 1 1 1 1 1 1 1
Number of 1's 6 8 12 12 7 11 10 10 12 11
Proportion Passed (p) 0.40 0.53 0.80 0.80 0.47 0.73 0.67 0.67 0.80 0.73

To calculate the proportion of the


sample who answered the item Second, I divided the number of
correctly, I first counted the number of students who answered the item
1s for each item. This gives the total correctly by the number of students
number of students who answered the who took the test, 15 in this case.
r =(
KR20 )(
k
k-1
1
)
pq
2

Next we need to calculate q for each item, the


proportion of the sample who answered each
item incorrectly.
Since students either passed or failed each
item, the sum p + q = 1.
The proportion of a whole sample is always 1.
Since the whole sample either passed or failed an
item, p + q will always equal 1.
r =( KR20
k
k-1 )( 1
pq
2 )
Student Math Problem
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Number of 1's 6 8 12 12 7 11 10 10 12 11
Proportion Passed (p) 0.40 0.53 0.80 0.80 0.47 0.73 0.67 0.67 0.80 0.73
Proportion Failed (q) 0.60 0.47 0.20 0.20 0.53 0.27 0.33 0.33 0.20 0.27

I calculated the percentage who failed


by the formula 1 p, or 1 minus the You will get the same answer if you
proportion who passed the item. count up the number of 0s for each
item and then divide by 15.
r =(
KR20
k
)(
k-1
1
)
pq
2

Now that we have p and q for each item, the


formula says that we need to multiply p by q
for each item.
Once we multiply p by q, we need to add up
these values for all of the items (the symbol
means to add up across all values).
r =(
KR20
k
k-1 )( 1
pq
2 )
pq = 2.05
Student Math Problem
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6
Number of 1's 6 8 12 12 7 11 10 10 12 11
Proportion Passed (p) 0.40 0.53 0.80 0.80 0.47 0.73 0.67 0.67 0.80 0.73
Proportion Failed (q) 0.60 0.47 0.20 0.20 0.53 0.27 0.33 0.33 0.20 0.27
pxq 0.24 0.25 0.16 0.16 0.25 0.20 0.22 0.22 0.16 0.20

In this column, I took p times


q. For example, (0.40 * 0.60)
= 0.24

Once we have p x q for every item, we sum up


these values.
.24 + .25 + .16 + + .20 = 2.05
r =(
KR20
k
)(
k-1
1
)
pq
2

Finally, we have to calculate 2, or the variance


of the total test scores.
r =( )( 1 pq
KR20 )
k
k-1 2
For each student, I calculated

2 = 5.57
their total exam score by
counting the number of 1s they
had.
Student Math Problem

Total
Exam
Name 1. 5+3 2. 7+2 3. 6+3 4. 9+1 5. 8+6 6. 7+5 7. 4+7 8. 9+2 9. 8+4 10. 5+6 Score
Sunday 1 1 1 1 1 1 1 1 1 1 10
Monday 1 0 0 1 0 0 1 1 0 1 5
Linda 1 0 1 0 0 1 1 1 1 0 6
Lois 1 0 1 1 1 0 0 1 0 0 5
Ayuba 0 0 0 0 0 1 1 0 1 1 4
Andrea 0 1 1 1 1 1 1 1 1 1 9
Thomas 0 1 1 1 1 1 1 1 1 1 9
Anna 0 0 1 1 0 1 1 0 1 0 5
Amos 0 1 1 1 1 1 1 1 1 1 9
Martha 0 0 1 1 0 1 0 1 1 1 6
Sabina 0 0 1 1 0 0 0 0 0 1 3
Augustine 1 1 0 0 0 1 0 0 1 1 5
Priscilla 1 1 1 1 1 1 1 1 1 1 10
Tunde 0 1 1 1 0 0 0 0 1 0 4
Daniel 0 1 1 1 1 1 1 1 1 1 9

The variation of the Total Exam Score is the squared standard deviation. I discussed
calculating the standard deviation in the example of a Descriptive Research Study in
side 34. The standard deviation of the Total Exam Score is 2.36. By taking 2.36 * 2.36,
we get the variance of 5.57.
r =(
KR20
k
)(
k-1
1
pq
2)
k = 10
pq = 2.05
2 = 5.57
Now that we know all of the values in the
equation, we can calculate rKR20.

r =(
KR20 )(
10
10 - 1
1
)
2.05
5.57

r = 1.11 * 0.63
KR20

rKR20 = 0.70
Split-Half Reliability
Likert Tests
If you administer a Likert Scale or have another
measure that does not have just one correct answer,
the preferable statistic to calculate the split-half
reliability is coefficient alpha (otherwise called
Cronbachs alpha).
However, coefficient alpha is difficult to calculate by hand. If
you have access to SPSS, use coefficient alpha to calculate
the reliability.
However, if you must calculate the reliability by hand, use
the Spearman Brown formula. Spearman Brown is not as
accurate, but is much easier to calculate.
Coefficient Alpha Spearman-Brown

r = ( )( )
Formula
k 2
i 2rhh
k-1
1
2 r =1+r
SB
hh
where = variance of one test
i
2

item. Other variables are where rhh = Pearson


identical to the KR-20 formula. correlation of scores in the
Split-Half Reliability
Spearman Brown Formula
To demonstrate calculating the Spearman Brown formula, I used the PANAS
Questionnaire that was administered in the Descriptive Research Study.
See the PowerPoint for the Descriptive Research Study for more information on the
measure.
The PANAS measures two constructs via Likert Scale: Positive Affect and
Negative Affect.
When we calculate reliability, we have to calculate it for each separate construct
that we measure.
The purpose of reliability is to determine how much error is present in the test score. If we
included questions for multiple constructs together, the reliability formula would assume
that the difference in constructs is error, which would give us a very low reliability
estimate.
Therefore, I first had to separate the items on the questionnaire into essentially two
separate tests: one for positive affect and one for negative affect.
The following calculations will only focus on the reliability estimate for positive affect. We
would have to do the same process separately for negative affect.
2rhh These are the 10 items

r =1+r
that measured positive
affect on the PANAS.
SB
hh

Questionnaire Item Number


S/
No 1 3 5 9 10 12 14 16 17 19
The data for each
1 5 4 4 5 1 4 4 5 3 4
participant is the code for
2 5 3 3 4 5 2 3 4 5 4 what they selected for
3 5 3 3 4 5 2 3 4 5 4 each item: 1 is slightly or
4 5 3 3 4 3 4 3 5 4 3 not at all, 2 is a little, 3 is
5 5 3 3 4 3 4 3 5 4 3
moderately, 4 is quite a
bit, and 5 is extremely.
Fourteen participants 6 5 3 5 3 1 3 2 3 3 1
took the test. 7 5 4 4 4 3 3 3 4 4 3
8 5 3 5 4 3 4 4 4 5 3
9 5 5 3 4 2 5 3 5 5 4
10 5 4 3 4 3 4 4 4 4 4
11 4 4 4 3 3 4 4 4 5 4
12 4 3 3 3 1 3 3 5 4 3
13 5 5 3 3 2 3 4 4 3 3
14 3 2 2 3 1 4 3 4 3 4

The first step is to split the questions into half. The recommended
procedure is to assign every other item to one half of the test.
If you simply take the first half of the items, the participants may
have become tired at the end of the questionnaire and the reliability
estimate will be artificially lower.
2rhh
r =1+r
SB
hh
The first half total was
calculated by adding up
Questionnaire Item Number the scores for items 1,
S/ 1 Half 2 Half
5, 10, 14, and 17.
No 1 3 5 9 10 12 14 16 17 19 Total Total
1 5 4 4 5 1 4 4 5 3 4 17 22
2 5 3 3 4 5 2 3 4 5 4 21 17
3 5 3 3 4 5 2 3 4 5 4 21 17
4 5 3 3 4 3 4 3 5 4 3 18 19
5 5 3 3 4 3 4 3 5 4 3 18 19
6 5 3 5 3 1 3 2 3 3 1 16 13
7 5 4 4 4 3 3 3 4 4 3 19 18
8 5 3 5 4 3 4 4 4 5 3 22 18
9 5 5 3 4 2 5 3 5 5 4 18 23
10 5 4 3 4 3 4 4 4 4 4 19 20
11 4 4 4 3 3 4 4 4 5 4 20 19
12 4 3 3 3 1 3 3 5 4 3 15 17
The second half total
13 5 5 3 3 2 3 4 4 3 3 17 18 was calculated by
14 3 2 2 3 1 4 3 4 3 4 12 17 adding up the scores for
items 3, 9, 12, 16, and
19
Split-Half Reliability
Spearman Brown Formula
Now that we have our two halves of the test,
we have to calculate the Pearson Product-
Moment Correlation between them.

(X X) (Y Y)
rxy =[ (X X) ] [Y Y) ]
2 2

In our case, X = one persons score on the first half of items, X = the mean
score on the first half of items, Y = one persons score on the second half of
items, Y = the mean score on the second half of items.
(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2

1 Half 2 Half
S/No Total Total
1 17 22 The first half total is
2 21 17 X.
3 21 17
4 18 19
The second half total is
5 18 19
Y.
6 16 13
7 19 18
8 22 18
9 18 23
10 19 20
11 20 19
12 15 17 We first have to
13 17 18 calculate the mean for
14 12 17 both halves.
Mea
n 18.1 18.4

This is X. This is
Y
(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2

1 Half 2 Half
S/No Total Total X-X Y-Y To get X X, we take each
1 17 22 -1.1 3.6 second persons first half total
2 21 17 2.9 -1.4 minus the average, which is
18.1. For example,
3 21 17 2.9 -1.4
21 18.1 = 2.9.
4 18 19 -0.1 0.6
5 18 19 -0.1 0.6
6 16 13 -2.1 -5.4
7 19 18 0.9 -0.4
8 22 18 3.9 -0.4
9 18 23 -0.1 4.6
10 19 20 0.9 1.6 To get Y Y, we take each
11 20 19 1.9 0.6
second half total minus the
average, which is 18.4. For
12 15 17 -3.1 -1.4
example,
13 17 18 -1.1 -0.4 18 18.4 = -0.4
14 12 17 -6.1 -1.4

Mea
n 18.1 18.4
(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2

(X X) (Y Y) = 12.66

1 Half 2 Half
S/No Total Total X-X Y-Y (X - X)(Y - Y)
Next we multiply (X X) times (Y
1 17 22 -1.1 3.6 -3.96 Y).
2 21 17 2.9 -1.4 -4.06
3 21 17 2.9 -1.4 -4.06
4 18 19 -0.1 0.6 -0.06
5 18 19 -0.1 0.6 -0.06
6 16 13 -2.1 -5.4 11.34
7 19 18 0.9 -0.4 -0.36
8 22 18 3.9 -0.4 -1.56
9 18 23 -0.1 4.6 -0.46
10 19 20 0.9 1.6 1.44
11 20 19 1.9 0.6 1.14
After we have multiplied,
12 15 17 -3.1 -1.4 4.34
we sum up the products.
13 17 18 -1.1 -0.4 0.44
14 12 17 -6.1 -1.4 8.54

Mean 18.1 18.4 Sum 12.66


(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ] 2 2

[(X X)2] [(Y Y)2] = 82.72


To calculate the
S/N 1 Half 2 Half denominator, we have
o Total Total X-X YY (X - X)2 (Y - Y)2 to square (X X) and (Y
1 17 22 -1.1 3.6 1.21 12.96 Y)
2 21 17 2.9 -1.4 8.41 1.96
3 21 17 2.9 -1.4 8.41 1.96
4 18 19 -0.1 0.6 0.01 0.36
5 18 19 -0.1 0.6 0.01 0.36
6 16 13 -2.1 -5.4 4.41 29.16
7 19 18 0.9 -0.4 0.81 0.16
8 22 18 3.9 -0.4 15.21 0.16
9 18 23 -0.1 4.6 0.01 21.16 Next we sum the squares across
10 19 20 0.9 1.6 0.81 2.56
the participants. Then we multiply
the sums.
11 20 19 1.9 0.6 3.61 0.36
90.94 * 75.24 = 6842.33.
12 15 17 -3.1 -1.4 9.61 1.96 Finally, 6842.33 = 82.72.
13 17 18 -1.1 -0.4 1.21 0.16
14 12 17 -6.1 -1.4 37.21 1.96

Sum 90.94 75.24


(X X) (Y Y)
rxy =[(X X) ] [(Y Y) ]
2 2

(X X) (Y Y) = 12.66

[(X X)2] [(Y Y)2] = 82.72

Now that we have calculated the numerator


and denominator, we can calculate rxy

12.66
rxy = 82.72

rxy = 0.15
2rhh
r =1+r
SB
hh

Now that we have calculated the Pearson


correlation between our two halves (rxy = 0.15),
we substitute this value for rhh and we can
calculate rSB

2 * 0.15
rSB = 1 + 0.15
0.3
rSB = 1.15

rxy = 0.26 The measure did not


have good reliability
in my sample!
How High Should Reliability Be?
A highly reliable test is always preferable to a test with
lower reliability.

. 80 > greater (Excellent)


.70 to .80 (Very Good)
.60 to .70 (Satisfactory) <.60 (Suspect)

A reliability coefficient of .80 indicates that 20% of the


variability in test scores is due to measurement error.
Conclusion:

The present day theory of measurement


and evaluation has enabled us to formulate a
comprehensive list of qualities of a good test.
But nothing can be taken as final about these
qualities. A keen examiner and an inquisitive
evaluator will always find that there is need
and scope for improving even the best among
available tests.
True teachers are those who use
themselves as bridges over which
they invite their students to cross;
then, having facilitated their crossing,
joyfully collapse, encouraging them to
create
their own.

Nikos Kazantzakis

Potrebbero piacerti anche