Handouts 5

ITEM ANALYSIS
• Item analysis gives you a way to exercise additional quality control over your
test. Well specified learning objectives and well constructed items give you a
head start on the process, but item analyses can give you feedback on how
successfully you were.
• Items can be analyzed:
a. Qualitatively – in terms of its content, form and validity. This also includes its
effective writing procedures.
b. Quantitatively – in terms of their statistical properties. This also includes
measurement of item difficulty and item discrimination.
• Test can be improved through the section, substitution and revision of test
items.
• Item Difficulty - determines the percentage (or proportion) of persons
who answered the test items correctly. It is the relative frequency with
which examinees choose the correct response. It commonly known as
the p value which ranges from 0.0 to 1.0.
• Example: A word that is correctly defined by 70% of the standardization
sample (p=.70) is regarded as easy. Than one that is correctly defined
by only 15% (p=.15).
• One of the basic rule in item difficulty is to arrange items in order of
difficulty so that test takers begin with relatively easy items and
proceed to items of increasing difficulty.
• This arrangement gives test takers the confidence in approaching the
test.
• Also reduces wasting the test takers time on items beyond their ability.
• Major reason for measuring item difficulty is to choose items suitable
difficulty level.
• If no one passes the item, it is then considered as excess baggage. It
will not provide information about individual differences. The same is
true of everyone passes. Since those excess baggage does not affect
the variability of test scores, they contribute nothing to the reliability
and validity of the test scores.
• The formula for item difficulty is:
Difficulty = # of test takers who answered an item correctly
_______________________________________ x 100
Total # tested
• Example: First item was administered to 25 students and let us assume that
23 students answered the item correctly.
• Difficulty = 23 / 25 x 100 = 92%
• Second item was administered to the same group and 14 students answered
the item correctly.
• Difficulty = 14 / 25 x 100 = 56%
• This means that item # 1 is easier than item # 2 since there was a higher
percentage of students who got the correct answer in item # 1 than in item #
2.
• For Criterion – reference test (post testing) - with their emphasis on mastery-
testing, many items on an exam form will have p-values of .9 or above.
• For Norm – referenced test (pre- testing) - are designed to be harder overall
and to spread out the examinees' scores. Thus, many of the items on an NRT
will have difficulty indexes between .4 and .6.
• Distribution of Test Scores
• The difficulty of items as a whole is of course directly dependent on
the difficulty of the test items that make up the test.
• A distribution of the test scores that is clearly skewed
• Ex. A. Piling at the lower end of the scale.
• This means that lacking of items that are easy and most are difficulty
to answer.
• The test takers would normally
obtained zero or near zero scores
in the test items.
• B. Piling at the Upper lower end of the scale
• Most test takers obtained nearly perfect scores
• It is impossible to measure individual differences
• Most items are easy
Based on the skewness, then it is also possible to suggest the more

easy and difficult items maybe added, eliminated or modified.
• Relating Item Difficulty to Testing Purposes
• Item difficulty is directly related to the testing purposes.
Example:
A test is to measure mastery skills for higher level, then the difficulty
of test items will be more harder a compared to other forms of test
items.
• If the purpose of the test is to ascertain whether an individual has
adequately mastered the basic essential skills on whether he or she
acquired the pre – requisite knowledge to advance to the next step in
a learning program.
• Item Discrimination of D – refers to the degree to which an item
differentiates correctly among the behavior that the test is designed to
measure.
• Choice Criterion
1. Criterion – related validation – comparing the test scores with a non –
test criterion
• This basis has been followed especially in the development of a certain
personality and interest test.
• Basis for item inclusion in Biographical Inventories which covers
heterogeneous collection of background of facts about the individual.
• Examples of Biographical Inventories
a. Empirically-keyed Biodata
b. Rationally-keyed Biodata
2. Domain – referenced – pertains to items that can be evaluated by comparing the item performance
of individuals who had varying amount of instruction in the relevant functions.
• Used determine whether individuals have reached a specified level of mastery.
• For achievement test – item discrimination is usually investigated against the total score. (kung ilan
yung raw score ng examinee vs. the total number of scores) Ex. Examinee obtained 23 out of 30 items.
• Aptitude test – item discrimination is based on the construct validation total score which is an
appropriate criterion for item selection. (the items are evaluated based on its ability to measure what
it purports to measure)
• Some item response are generally recorded as right or wrong, the measurement of item
discrimination usually involves a dichotomous variable (the item) and the continuous variable (the
criterion).
• For dichotomous variable example:
Multiple choice items wherein there are correct and incorrect items to measure the cognitive ability
of the person.
Yes or No items
Continuous variable (the criterion, variable being predicted) – something that maybe judge or decided
should be measurable. In some instance criterion also becomes dichotomous in nature.
• Two Discrimination Coefficients – they are utilized to know the
relationship between an item and criterion.
• Point Biserial Correlation - the relationship between how well
students did on the item and their total exam score.
• Biserial correlation – it assumes a continuous and normal distribution
of traits underlying both the dichotomous item response and the
criterion variable.
• It is useful for examining the relative performance of different groups
or individuals at the same time.
Procedures in Item Analysis
1. After the test administration and obtaining test scores.
2. Arrange the scores in descending order
3. Separate two sub groups of test papers
4. Take 27% of the scores out of the highest scores and 27% of the
scores falling at bottom.
5. Count the number of correct answer in the highest group and the
correct answer in the lowest group
6. Count the non- response examinee
Item analysis is done by obtaining :
a. Difficulty value or the p value
b. Discriminative power or D
Analysis of Difficulty Index:
P value = UL + LL
______
N
Where: UL = # of test takers who answered the item correctly in the upper limit
LL = # of test takers who answered the item correctly in the lower
Limit
N= total number of test takers
• It very common in social sciences to use 27% based on the assumption that 27% rules.
Wiersma and Jurs (1990) stated that "27% is used because it has shown that this value will
maximize differences in normal distributions while providing enough cases for analysis" (p.
145).
• Analysis of Discriminating Power:
D = UL – LL
________
Half of N
Where: UL = # of test takers who answered the item correctly in the
upper limit
LL = # of test takers who answered the item correctly in the
lower
Limit
N= total number of test takers
• Based on the indices and difficulty discrimination by Hopkins and
Antes
RELIABILITY AND VALIDITY OF TEST
ITEMS
• Reliability – which is the best single measure of test accuracy, it is the
extent to which test results are consistent, stable and free of error
variance. It is the extent to which a test provides the same ranking of
examinees when it is re – administered.
• Reliability can be measured by Coefficient Alpha or KR – 20
• A reliable test may not be valid. Ex. A yardstick only 36 inches long will
measure consistently, but inaccurately resulting in invalid data.
• Reliability coefficient may depend on the test length. Reliabilities as
low as .50 are satisfactory for short tests of 10 to 15 items, but tests
with more than 50 items should have reliabilities of .80 or higher
• True score and error model is always accepted in calculating reliability. True scores
and random error is always included in the reliability of test scores.
• True scores are dependable measure of person’s obtained score uninfluenced by
chance events or conditions.
• Low reliability results from chance differences and is affected by different factors:
• Variation in examinees responses due to physiological or psychological conditions
1. Too many very easy or hard items
2. Poorly written or confusing items
3. Testing conditions such as temp, noise etc.
4. Items that does not unify the body of content
5. Error in recording
6. Test length
• Reliability coefficient – to measure the amount of error associated
with exam score.
• The higher the value, the more reliable the overall exam. (The range is
form 0.0 to 1.0)
• Typically, the internal consistency reliability is measured. This
indicates how well the items are correlated with one another.
• High reliability indicates that the items are all measuring the same
thing, or general construct (e.g. knowledge of how to calculate
integrals for a Calculus course).
• With multiple-choice items that are scored correct /incorrect, the
Kuder-Richardson formula 20 (KR-20) is often used to calculate the
internal consistency reliability.
• KR 20 – lets you know who among students mastered the subject
matter and those who did not.
• The closer the KR(20) is to +1.0 the more reliable an exam is
considered because its questions do a good job consistently
discriminating among higher and lower performing students
• Cronbach’s Alpha - used to asses the reliability or internal consistent
of a set of items. This provides information to the extent to which an
items consistently measures a concept. It measures the strength of
consistency.
• This also reminds test developers if test items are redundant.
Reliability Interpretation
.90 and Excellent reliability; at the level of the best standardized tests
above
.80 – .90 Very good for a classroom test
.70 – .80 Good for a classroom test; in the range of most. There are
probably a few items which could be improved.
.60 – .70 Somewhat low. This test needs to be supplemented by other

measures (e.g., more tests) to determine grades. There are
probably some items which could be improved.
.50 – .60 Suggests need for revision of test, unless it is quite short (ten or
fewer items). The test definitely needs to be supplemented by
other measures (e.g., more tests) for grading.
.50 or Questionable reliability. This test should not contribute heavily to

below the course grade, and it needs revision.
Validity – is the extent to which a test measures what it is supposed to
measure. It is the most critical dimension of test development.
Validity is the most important consideration in test evaluation. The concept
refers to the appropriateness, meaningfulness, and usefulness of the specific
inferences made from test scores. Test validation is a process of accumulating
evidence to support such inferences.
Types of Validity
1. Face Validity – estimates whether a test measures what it claims to
measure. This specifies whether the measurement procedure you use
appears to be a valid measure of a given construct / variable.
For example, we may choose to use questionnaire items, interview questions,
and so forth. These questionnaire items or interview questions are part of
the measurement procedure. This measurement procedure should provide an
accurate representation of the variable (or construct) it is measuring if it is to
be considered valid.
2. Content Validity – the degree to which a test matches a curriculum and accurately
measures the specific training objectives on which the program is based. Expert
judgment is required.
• Is the extent to which the elements within a measurement procedure
are relevant and representative of the construct that they will be used to measure
(Haynes et al., 1995).
• In other words, do the questions really assess the construct in question, or are the
responses by the person answering the questions influenced by other factors?
3. Criterion – related validity – measures how well a tests compares with an external
criterion. It includes
• 3.a. Predictive Validity - is the correlation between a predictor and a criterion
obtained at a later time. It measures how well the assessment results can predict a
relationship between the construct of being measured and future behavior (e.g.,
test score on a specific competence and caseworker performance of a job-related
tasks).
• 3.b. Concurrent validity is the correlation between a predictor and a
criterion at the same point in time. It measures how well the results of one
assessment correlate with other assessments designed to measure the
same thing (e.g., performance on a cognitive test related to training and
scores on a Civil Service examination).
• 3.c. Construct validity is the extent to which a test measures a theoretical
construct (e.g., a researcher examines a personality test to determine if the
personality typologies account for actual results).
• Does the theoretical concept match up with a specific measurement item?
• Example: Bobo doll experiment. The Bobo Doll Experiment was performed
in 1961 by Albert Bandura, to try and add credence to his belief that all
human behavior was learned, through social imitation and copying, rather
than inherited through genetic factors. These concepts are abstract and
theoretical but have been observed in practice.

Handouts 5

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Handouts 5

Caricato da

Copyright:

Formati disponibili

ITEM ANALYSIS

Based on the skewness, then it is also possible to suggest the more

.80 – .90 Very good for a classroom test

.60 – .70 Somewhat low. This test needs to be supplemented by other

.50 or Questionable reliability. This test should not contribute heavily to

Potrebbero piacerti anche