Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Discussion topics
Define reliability. What does it encompass?
Look at the various approaches in the attached handout. Which ones are useful for assessing accuracy?
Which ones are useful for assessing stability?
How would you assess reliability for a test you are designing?
If you obtained a low reliability coefficient, what could you do to improve it?
2
SOURCES OF VARIATION REPRESENTED IN DIFFERENT PROCEDURES
FOR ESTIMATING RELIABILITY
Source of Variation
Methods of
Estimating
Reliability
Variation Caused
by the
Measurement
Procedure
Variation Caused by
Respondents
Day-to-Day
Variability
1.
Immediate
retest with
same test
2.
Retest after
interval with
same test
3.
Parallel test
form without
time interval
4.
Parallel test
form with
time interval
5.
Odd-even
halves of
single test
6.
KuderRichardson
single test
analysis
Cronbachs
alpha for
single test
analysis.
7.
Variation in
Respondents
Speed of work
3
FACTORS INFLUENCING TEST RELIABILITY
1. The greater the number of items, the more accurate the test. The respondents mental set for
accuracy is important for reliability. That is, variation in incentive or effort are important.
Perseverations from previous mental or emotional experiences are important.
2. On the whole, the longer the test administration time, the greater the accuracy. Stability may decline
if tests are too long.
3. The narrower the range of difficulty of items, the greater the reliability. Items of moderate difficulty
are preferred over easy or hard items.
4. Interdependent items are those which require a correct answer on one item before it is possible to
obtain a correct answer on others. Such grouped items tend to reduce the reliability.
5. The more systematic or objective the scoring, the greater the reliability coefficient. Error due to
mis-scored items reduces accuracy.
6. The greater the probability of achieving success by chance (guessing), the lower the reliability.
7. The more homogeneous the material, the greater the reliability.
8. Reliability is affected by the extent to which individuals have similar characteristics. Restricted range
of characteristics in your sample can result in low reliability if there is no variance. If there is
variance, reliability can be increased.
9. Trick questions lower the accuracy. Subtle factors leading to misinterpretation of the test item lead to
unreliability.
10. Speed of work on test influences accuracy. Some test-takers are set for speed and some are not.
Some test-takers distribute their time properly; some do not.
11. Distractions have some effect on accuracy, although those effects can be overrated. Accidents, like
breaking a pencil or finding a defective test blank, are incidental factors. The respondents attention to
the task may be limited by illness, worry, or excitement. These can affect accuracy although not
always to the extent that most people think.
12. Reliability generally decreases when there is intervening time between tests. Delayed posttests are
given for the purposes of establishing validity, not reliability.
13. Cheating may be a factor in lowering accuracy or stability.
14. Position of the individual on the learning curve for the tasks of the test may be important. (restriction
of range)
4
In Linden, K.W. (1985) Designing tools for assessing classroom achievement:
A handbook of materials and exercises for Education 524.
Reviewing Teacher-made Tests
From Mitchell, R.J. Measurement in the classroom. Dubuque, Iowa: Rendall-Hunt, 1972,
pp. 115-116
The comments and suggestions which have been offered in the preceding pages are appropriate for the
planning and constructing of the different types of test items. The purpose of the following suggestions is
to present briefly the basic principles or ideas which apply to the development of classroom tests.
1. Item Format
A. The items in the tests are numbered consecutively.
B. Each item is complete on a page.
C. Reference material for an item appears on the same page.
D. The item responses are arranged to achieve both legibility and economy of space.
2. Scoring Arrangements
A. Consideration has been given to the practicability of a separate answer sheet.
B. Answers are to be indicated by symbols rather than underlining or copying.
C. Answer spaces are placed in a vertical column for easy scoring.
D. If answer spaces are placed at the right of the page, each answer space is clearly associated
with its corresponding item.
E. Answer symbols to be used by the students are free from possible ambiguity due to careless
penmanship or deliberate hedging.
F. Answer symbols to be used by the students are free from confusion with the substance or content
of the responses.
3. Distribution of Correct Responses
A. Correct answers are distributed so that the same answer does not appear in a long series of
consecutive questions.
B. Correct answers are distributed to avoid an excessive proportion of items in the test with the
same answer.
C. Patterning of answers in a fixed repeating sequence is avoided.
and
6
Values of r for Different Levels of Significance*
Levels of Significance
Sample Size(n)
5
10
11
12
13
14
15
.05
.7545
.5760
.5529
.5324
.5139
.4973
.4821
.02
.8329
.6581
.6339
.6120
.5923
.5742
.5577
.01
.8745
.7079
.6835
.6614
.6411
.6226
.6055
.001
.9507
.8233
.8010
.7800
.7603
.7420
.7246
16
17
18
19
20
.4683
.4555
.4438
.4329
.4227
.5425
.5285
.5155
.5034
.4921
.5897
.5751
.5614
.5487
.5368
.7084
.6932
.6787
.6652
.6524
25
30
35
40
45
.3809
.3494
.3246
.3044
.2875
.4451
.4093
.3810
.3578
.3384
.4869
.4487
.4182
.3932
.3721
.5974
.5541
.5189
.4896
.4648
50
60
70
80
90
.2732
.2500
.2319
.2172
.2050
.3218
.2948
.2737
.2565
.2422
.3541
.3248
.3017
.2830
.2673
.4433
.4078
.3799
.3568
.3375
100
.1946
.2301
.2540
.3211
*Reduced version of Table VI of R.A. Fisher and F. Yates: Statistical Tables for Biological,
Agricultural, and Medical Research, Oliver & Boyd Ltd., Edinburgh.
= Mean Score:
A: X = X N , where
B:
= AM +
(X X )
z=1
XX
+ 0,
T-score
( X = 50; s = (10)
T score = 10
XX
+ 50
X (k X )
k
1
k 1
s2 (k )
8
In Linden, K.W. (1985)
Designing tools for assessing classroom achievement:
A handbook of materials and exercises for Education 524
INTERPRETATION OF CORRELATION COEFFICIENTS
1. When may we call a coefficient high or low?
Stable coefficients from .00 to .20 = negligible correlation
1 r2 )
6. Is there a direct arithmetical relationship between the size of a correlation and its value? Is a
coefficient of .75 three times as good as .25?
NO! A statement can be made more accurate by looking at the squares of the correlation
coefficients. The square of .25 is .0626, while the square of .75 is .5625. On this basis, a
coefficient of .75 is nine times, not three times, better than a correlation of .25.
9
Likert Scale Reliability Procedures
1. Run a factor analysis including all the scale items with a rotation that is either varimax or
oblimin (look up the difference).
2. Look at the first factor matrix obtained.
3. Delete all items that do not load at least .33 on Factor 1.
4. Re-run the analysis without those items.
5. Look at the rotated factor matrix.
6. Identify subscales (groups of items that load at least .33 on a given factor)
For each questionnaire item, look to see what factor has the highest loading.
Ambiguous items are those that load well on more than one factor
(they typically have about the same factor loading on more than one factor.
7. Check to see if any items are loaded negatively and reverse the scoring of that item.
8. Run the Cronbachs alpha program with the item analysis option.
9. Interpret the item analysis (Would the reliability go up if a particular item was deleted?)
10. Create the scales (I like to average items to facilitate comparison of means across scales, but
that is not necessary.)