Sei sulla pagina 1di 9

Reliability

Discussion topics
Define reliability. What does it encompass?

Look at the various approaches in the attached handout. Which ones are useful for assessing accuracy?
Which ones are useful for assessing stability?

How would you assess reliability for a test you are designing?

What could happen to diminish the reliability of your measure?

If you obtained a low reliability coefficient, what could you do to improve it?

2
SOURCES OF VARIATION REPRESENTED IN DIFFERENT PROCEDURES
FOR ESTIMATING RELIABILITY
Source of Variation
Methods of
Estimating
Reliability

Variation Caused
by the
Measurement
Procedure

Variation Caused by
Respondents
Day-to-Day
Variability

Variation in the Items


Sampled from the
Content Domain

1.

Immediate
retest with
same test

2.

Retest after
interval with
same test

3.

Parallel test
form without
time interval

4.

Parallel test
form with
time interval

5.

Odd-even
halves of
single test

6.

KuderRichardson
single test
analysis
Cronbachs
alpha for
single test
analysis.

7.

Variation in
Respondents
Speed of work

Which measures are best for establishing stability?


Which measures are best for establishing accuracy (equivalence)?
If consistency=stability+accuracy, which measures are ideal for establishing reliability of the measure
you designed last week?

3
FACTORS INFLUENCING TEST RELIABILITY
1. The greater the number of items, the more accurate the test. The respondents mental set for
accuracy is important for reliability. That is, variation in incentive or effort are important.
Perseverations from previous mental or emotional experiences are important.
2. On the whole, the longer the test administration time, the greater the accuracy. Stability may decline
if tests are too long.
3. The narrower the range of difficulty of items, the greater the reliability. Items of moderate difficulty
are preferred over easy or hard items.
4. Interdependent items are those which require a correct answer on one item before it is possible to
obtain a correct answer on others. Such grouped items tend to reduce the reliability.
5. The more systematic or objective the scoring, the greater the reliability coefficient. Error due to
mis-scored items reduces accuracy.
6. The greater the probability of achieving success by chance (guessing), the lower the reliability.
7. The more homogeneous the material, the greater the reliability.
8. Reliability is affected by the extent to which individuals have similar characteristics. Restricted range
of characteristics in your sample can result in low reliability if there is no variance. If there is
variance, reliability can be increased.
9. Trick questions lower the accuracy. Subtle factors leading to misinterpretation of the test item lead to
unreliability.
10. Speed of work on test influences accuracy. Some test-takers are set for speed and some are not.
Some test-takers distribute their time properly; some do not.
11. Distractions have some effect on accuracy, although those effects can be overrated. Accidents, like
breaking a pencil or finding a defective test blank, are incidental factors. The respondents attention to
the task may be limited by illness, worry, or excitement. These can affect accuracy although not
always to the extent that most people think.
12. Reliability generally decreases when there is intervening time between tests. Delayed posttests are
given for the purposes of establishing validity, not reliability.
13. Cheating may be a factor in lowering accuracy or stability.
14. Position of the individual on the learning curve for the tasks of the test may be important. (restriction
of range)

4
In Linden, K.W. (1985) Designing tools for assessing classroom achievement:
A handbook of materials and exercises for Education 524.
Reviewing Teacher-made Tests
From Mitchell, R.J. Measurement in the classroom. Dubuque, Iowa: Rendall-Hunt, 1972,
pp. 115-116
The comments and suggestions which have been offered in the preceding pages are appropriate for the
planning and constructing of the different types of test items. The purpose of the following suggestions is
to present briefly the basic principles or ideas which apply to the development of classroom tests.
1. Item Format
A. The items in the tests are numbered consecutively.
B. Each item is complete on a page.
C. Reference material for an item appears on the same page.
D. The item responses are arranged to achieve both legibility and economy of space.
2. Scoring Arrangements
A. Consideration has been given to the practicability of a separate answer sheet.
B. Answers are to be indicated by symbols rather than underlining or copying.
C. Answer spaces are placed in a vertical column for easy scoring.
D. If answer spaces are placed at the right of the page, each answer space is clearly associated
with its corresponding item.
E. Answer symbols to be used by the students are free from possible ambiguity due to careless
penmanship or deliberate hedging.
F. Answer symbols to be used by the students are free from confusion with the substance or content
of the responses.
3. Distribution of Correct Responses
A. Correct answers are distributed so that the same answer does not appear in a long series of
consecutive questions.
B. Correct answers are distributed to avoid an excessive proportion of items in the test with the
same answer.
C. Patterning of answers in a fixed repeating sequence is avoided.

4. Grouping and Arrangement of Items


A. Items of the same type requiring the same directions are grouped together in the test.
B. Where juxtaposition of items of markedly dissimilar content is likely to cause confusion,
items
are grouped by content within each item type grouping.
C. Items are generally arranged from easy to more difficult within the test as a whole and within each
major sub-division of the test.
5. Designating Credit Allowances
A. Credits are indicated for the major sections of the tests.
B. The credit allowance for each item is clear to the student.
C. Where questions have subdivisions, especially in essay questions, credits are indicated for each of
the parts of the question.
6. Directions for Answering Questions
A. Simple, clear, and specific directions are given for each different item type in the test.
B. Directions are clearly set off from the rest of the test by appropriate spacing or type style.
C. Effective use is made of sample questions and answers to help clarify directions for unusual
item types.
7. Guessing
A. If deductions are to be made for wrong answers, pupils are instructed not to guess.
B. If no deductions are to be made for wrong answers, pupils are advised to answer every
question according to their best judgment.
8. Allowing Choice of Items
A. The degree of choice is sufficiently limited, and questions among which choice is allowed are
sufficiently in difficulty, to maintain reasonable comparability of pupils scores.
B. Directions covering choice are prominent, clear, and explicit.
C. Choice is exercised within relatively small groups of items rather than among many items.
9. Printing and Duplicating
A. The test has been duplicated to provide individual student copies.
B. The test is free from annoying and confusing typographical errors.
C. Legibility of the test is satisfactory from the viewpoint of type size, adequacy of spacing
clarity of printing.
D. The length of line is neither too long nor too short for easy comprehension.

and

6
Values of r for Different Levels of Significance*
Levels of Significance
Sample Size(n)
5
10
11
12
13
14
15

.05
.7545
.5760
.5529
.5324
.5139
.4973
.4821

.02
.8329
.6581
.6339
.6120
.5923
.5742
.5577

.01
.8745
.7079
.6835
.6614
.6411
.6226
.6055

.001
.9507
.8233
.8010
.7800
.7603
.7420
.7246

16
17
18
19
20

.4683
.4555
.4438
.4329
.4227

.5425
.5285
.5155
.5034
.4921

.5897
.5751
.5614
.5487
.5368

.7084
.6932
.6787
.6652
.6524

25
30
35
40
45

.3809
.3494
.3246
.3044
.2875

.4451
.4093
.3810
.3578
.3384

.4869
.4487
.4182
.3932
.3721

.5974
.5541
.5189
.4896
.4648

50
60
70
80
90

.2732
.2500
.2319
.2172
.2050

.3218
.2948
.2737
.2565
.2422

.3541
.3248
.3017
.2830
.2673

.4433
.4078
.3799
.3568
.3375

100

.1946

.2301

.2540

.3211

*Reduced version of Table VI of R.A. Fisher and F. Yates: Statistical Tables for Biological,
Agricultural, and Medical Research, Oliver & Boyd Ltd., Edinburgh.

Computational Formulas for Test Analysis


In Linden, K.W. (1985)
Designing tools for assessing classroom achievement:
A handbook of materials and exercises for Education 524.
Measures of Central Tendency
Modal Score (Mo) = Most popular score
Median Score (Mdn) = Score presenting the peformance of the middle person in the group - the
midpoint of the distribution.
X

= Mean Score:
A: X = X N , where
B:

= AM +

= sum of raw scores (X) and N = number of cases.

fd , where AM = assumed mean (zero point for deviation method),


N

N = number of cases and fd = sum of all deviation scores (f = frequency).


Measure of Variability
Range (R) = (High Score - Low Score) + 1
Quartile Deviation (QD) = score for 75%ile rank - score for 25%ile rank
2
Standard Deviation (SD or s) = s =
X = each raw score,
Standard Scores
Basic z - score
( X = 0; s = 1)

(X X )

= mean, N = number of students

z=1

where X = any raw score,

XX

+ 0,

= mean of scores, s = standard deviation of scores

T-score
( X = 50; s = (10)

T score = 10

XX

+ 50

Internal Consistency Reliability and Standard Error of Measurement


r KR21

X (k X )
k
1

k 1
s2 (k )

where k = number of items s2 = standard deviation squared (variance of scores)


sm (standard error of measurement)
sm = s 1 r
where r is reliability and s is standard deviation of scores

8
In Linden, K.W. (1985)
Designing tools for assessing classroom achievement:
A handbook of materials and exercises for Education 524
INTERPRETATION OF CORRELATION COEFFICIENTS
1. When may we call a coefficient high or low?
Stable coefficients from .00 to .20 = negligible correlation

.20 to .40 = low degree of correlation

.40 to .60 = moderate degree of correlation

.60 to .80 = marked degree of correlation

.80 to 1.00 = high degree of relation


2. How high must a correlation be to be regarded as satisfactory?
The function of a coefficient of correlation is to measure the degree of association between
two variables. In some situations a correlation of .00 might be satisfactory, and in others a
correlation of .90 might be regarded as unsatisfactory. The coefficient stands merely as a
statement of fact.
3. Does correlation imply a causal relationship between two traits studied?
NO!
4. Does the correlation coefficient indicate the percentage of agreement between the two traits?
Does a coefficient of .20 mean 20 percent agreement?
NO! However, from the coefficient we can obtain a statement regarding the degree of overlap
between the two variables. This is done by squaring the coefficient.
5. Does a knowledge of the coefficient of correlation between two traits enable us to predict one
from the other?
YES, but the relationship between the size of a correlation coefficient and its predictive value
is not directly proportional. The lower correlations are of almost no value in prediction; the
moderate ones only slightly better; and the marked coefficients are somewhat, but not very
much better. Only as we advance into the high correlation range do the predictive values rise
to usable levels. A statement of predictive efficiency can be found by the following formula:
100 (1 -

1 r2 )

6. Is there a direct arithmetical relationship between the size of a correlation and its value? Is a
coefficient of .75 three times as good as .25?
NO! A statement can be made more accurate by looking at the squares of the correlation
coefficients. The square of .25 is .0626, while the square of .75 is .5625. On this basis, a
coefficient of .75 is nine times, not three times, better than a correlation of .25.

9
Likert Scale Reliability Procedures
1. Run a factor analysis including all the scale items with a rotation that is either varimax or
oblimin (look up the difference).
2. Look at the first factor matrix obtained.
3. Delete all items that do not load at least .33 on Factor 1.
4. Re-run the analysis without those items.
5. Look at the rotated factor matrix.
6. Identify subscales (groups of items that load at least .33 on a given factor)
For each questionnaire item, look to see what factor has the highest loading.
Ambiguous items are those that load well on more than one factor
(they typically have about the same factor loading on more than one factor.
7. Check to see if any items are loaded negatively and reverse the scoring of that item.
8. Run the Cronbachs alpha program with the item analysis option.
9. Interpret the item analysis (Would the reliability go up if a particular item was deleted?)
10. Create the scales (I like to average items to facilitate comparison of means across scales, but
that is not necessary.)

Potrebbero piacerti anche