Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
tests
Daisy Powell
Summary
• Background to standardized testing
– Psychometrics and the study of IQ
• Principles of standardized test construction
• Critical example: standardized tests of reading
– Word recognition tests
– Reading comprehension tests
• Practical session
– Issues of test construction in the Malaysian context
• Summary: pros and cons of standardized testing
1
Background
• Origins of standardized testing
– Alfred Binet
• Advent of compulsory primary education in France
• 1905: Need for reliable way to identify children with
special educational needs
– Army “Mental Tests”
• Routinely used by the Second World War to assign US
servicemen to appropriate jobs
– Now routinely used across range of settings:
• Clinical
• Educational
• Academic research
• Business – etc.
Test construction
• Test purpose
– Huge variety in scope of tests
• Wechsler Adult Intelligence Scale
• Test of Word Reading Efficiency
– Will the test assess what it’s supposed to be assessing?
– Important to carefully target age-range
• Different versions for different ages?
– CTOPP (4-6; 7-21)
– YARC (early years; primary; secondary)
2
Nature of stimulus items
• Do they test what they are supposed to?
• What type of items?
– Multiple choice vs. definitions
• Relevance (cultural/linguistic)
• Do tests assess what they set out to assess?
– Nature of test itself can have an impact
– In reading comprehension: word recognition and/or linguistic
comprehension (more on this later…)
• Adequate range of difficulty?
– Fundamental to ensure sensitivity of test
• How many items? (Longer tests more reliable – but…)
3
Define scoring procedures
• Always crucial
• Sometimes even more crucial
• Objectivity vs. subjectivity in scoring
– E.g. Tests of expressive vocabulary
– What is a good definition of the word “knife”
• What are the criteria for allocating marks to responses
• Issue of reliability
• Essential to removing experimenter bias
Test properties
• Piloting process to establish key test perameters
– Do items cover an appropriate range of abilities?
– Do they reliably assess what they are supposed to?
4
Reliability
• Test-retest: The extent to which a test yields the same
score when given to a participant on two different
occasions
5
Validity
• Content validity: Test’s ability to sample the content
that is being measured
• Concurrent validity: The relation between a test’s
score and other available measures of the same
construct
• Predictive validity: The relationship between test’s
score and future performance
• Construct validity: The extent to which there is
evidence that a test measures a particular construct
• Ecological validity: real world relevance
11
6
The normal distribution
average
range
Disorder
Derived scores
• Conversion of raw to derived scores
• Standard score = z score (mean = 0; sd = 1)
• To calculate z:
– z = (x – µ) / σ
– For example, let’s say you have a test score of 190. The test
has a mean (µ) of 150 and a standard deviation (σ) of 25.
Assuming a normal distribution, your z score would be:
– = 190 – 150 / 25 = 1.6.
– You scored over 1.5sd above the mean
14
7
Derived scores
• Conversion of raw to derived scores
• Facilitates test interpretation
• Wide range of derived scores
– Percentiles
– Standard scores (mean = 100, sd = 15)
– Scaled scores (mean = 10, SD = 3)
– Stanines
– Age equivalent
• ALL DESCRIBE MEAN AND SD
15
8
Selecting an appropriate
standardization sample
• One of the main challenges of developing new,
culturally sensitive tests
• Linguistic background of standardization sample
– Monolingual, bilingual, trilingual
– Education in L1?
• What issues to consider in tests of:
– Oral language
– Written language
• Size of standardization sample
• How fine grained are age bands?
17
Representativeness
• How representative is the standardizations sample of
the population from which it was drawn?
• Consider the demographic characteristics of the
members of the standardization sample
– age
– sex
– race/ethnicity
– parent education level
– geographic location (rural vs. urban)
• Norm group should be proportioned across these
variables in a way that reflects prevalence in the
population from which sample is drawn.
18
9
Nature of the standardization sample
• Will the test be used to assess children with special
needs (e.g. autism; dyslexia)?
• If so, then the standardization sample should include
children with this profile
• “Norms, if used, should refer to clearly described
populations. These populations should include
individuals or groups to whom test users will ordinarily
wish to compare their own examinees” (AERA et al.,
1999, p. 55).
19
20
10
Critical example: word reading
• Diagnostic Test of Word • Test of Word Reading
Reading Processes Efficiency
• Reading accuracy • Reading efficiency (no. items
• 3 subtests read correctly in 45 secs)
– Nonwords • 2 subtests
– Regular words – Words
– Exception words – Nonwords
• UK • US
• 6 – 12 years • 6 – 25 years
Word reading
• Different types of letterstring
– Regular words vs irregular words vs nonwords
– Mono-syllabic vs multisyllabic
– High Frequency vs Low Frequency
– All such factors will influence performance
• Accuracy vs fluency
• In spite of these issues, word reading is a relatively
clearly defined construct
• -> Assessing word reading is reasonable
straightforward
• -> Assessments tend to be reliable and valid
11
Critical example: Reading Comprehension
• Many different existing tests of reading comprehension
• Reading comprehension is a much more complex
construct than word recognition
– Simple View (Tunmer & Chapman, 2012)
• Language comprehension X word recognition
• Different format of questions:
– Cloze (sentence completion)
– Multiple choice
– Open question
• Sentence vs. short text vs. long text comprehension
• Do different tests assess the same thing?
12
Keenan, Betjemann & Olson, 2008
• Comparison between various tests
• Does the format of a test determine what aspects of
reading it assesses?
• Are there developmental changes in the extent to
which different tests tap different underlying
processes?
13
Keenan, Betjemann & Olson, 2008
14
Keenan, Betjemann & Olson (2008)
• All about test format?
• No – WJPC uses cloze format, but PIAT uses multiple
choice
• Keenan et al. argue it is because of passages are
short in these two tests
– In WJPC and PIAT, correct performance was often dependent
on having correctly decoded one single word
– In a short passage, decoding problems likely to have a
catastrophic effect, whereas in longer passage, context may
help support problems decoding individual words
• Why do decoding and language comprehension
account for so little variance?
– Betjemann (2008) GORT: real world knowledge sufficient to
answer questions…
Developmental differences?
15
Keenan, Betjemann & Olson, 2008
• Summary
– Tests of reading comprehension do not all test the same thing
– Some primarily test decoding/word recognition
– Some primarily test linguistic comprehension
– Some test different things at different age/reading level
– Some test prior real world knowledge and depend on neither
linguistic comprehension or decoding!
Practical session
• Norm referencing tests
• In groups, pick construct (e.g. vocabulary; single word
reading; reading comprehension; grammar;
phonological awareness)
– Have a look at some of the standardized tests we’ve brought
along
– What are key issues to consider when
• a) developing English language test for use with Malaysian
children
• b) creating Malay/Tamil/Mandarin versions of tests
32
16
Why are standard scores useful?
• Establishing performance in relation to peers
• Comparing performance longitudinally across age
groups
– E.g. monitoring effectiveness of intervention
• Comparing across tests
– A standard score of 100 is always average irrespective
of test
– Discrepancy defined disorders e.g., dyslexia
– Note though that different standardisation samples will
usually have been used across different tests
34
17
Alternative forms of assessment
18
Reading
• American Educational Research Association, American
Psychological Association, & the National Council on
Measurement in Education (1999). Standards for educational and
psychological testing. Washington, DC: American Educational
Research Association.
• Flanagan, D. P., & Ortiz, S.O. (2007). Essentials of cross-battery
assessment (2nd ed.). New York: Wiley.
• Keenan, J.M., Betjemann, R.S., Pennington, B.F. Willcutt, E., &
Olson, R.K. (2013). Reading comprehension tests vary in the
skills they assess: Differential dependence on decoding and oral
comprehension. Scientific Studies of Reading, 12, 281-300
• Ornstein, A. C. (1993). Norm-referenced and criterion-referenced
tests: An overview. NASSP Bulletin, 77(555), 28–39.
• Salvia, J., & Ysseldyke, J. E. (1991). Assessment (5th ed.).
Boston: Houghton Mifflin.
37
19