Sei sulla pagina 1di 19

Using and developing standardized

tests
Daisy Powell

December 13, 2016 © University of Reading 2008 www.reading.ac.uk

Summary
•  Background to standardized testing
–  Psychometrics and the study of IQ
•  Principles of standardized test construction
•  Critical example: standardized tests of reading
–  Word recognition tests
–  Reading comprehension tests
•  Practical session
–  Issues of test construction in the Malaysian context
•  Summary: pros and cons of standardized testing

1
Background
•  Origins of standardized testing
–  Alfred Binet
•  Advent of compulsory primary education in France
•  1905: Need for reliable way to identify children with
special educational needs
–  Army “Mental Tests”
•  Routinely used by the Second World War to assign US
servicemen to appropriate jobs
–  Now routinely used across range of settings:
•  Clinical
•  Educational
•  Academic research
•  Business – etc.

Test construction
•  Test purpose
–  Huge variety in scope of tests
•  Wechsler Adult Intelligence Scale
•  Test of Word Reading Efficiency
–  Will the test assess what it’s supposed to be assessing?
–  Important to carefully target age-range
•  Different versions for different ages?
–  CTOPP (4-6; 7-21)
–  YARC (early years; primary; secondary)

2
Nature of stimulus items
•  Do they test what they are supposed to?
•  What type of items?
–  Multiple choice vs. definitions
•  Relevance (cultural/linguistic)
•  Do tests assess what they set out to assess?
–  Nature of test itself can have an impact
–  In reading comprehension: word recognition and/or linguistic
comprehension (more on this later…)
•  Adequate range of difficulty?
–  Fundamental to ensure sensitivity of test
•  How many items? (Longer tests more reliable – but…)

Define test administration procedures


•  All participants must be given equal chance to respond
correctly
•  Clearly defined test administration instructions
•  Who can administer tests?
–  Teacher/SENCo/Researcher/Clinician
•  Need to consider
–  Should recordings be made of orally administered questions?
•  Group vs. individual administration
•  Computer vs. pencil and paper (or both alternatives)

3
Define scoring procedures

•  Always crucial
•  Sometimes even more crucial
•  Objectivity vs. subjectivity in scoring
–  E.g. Tests of expressive vocabulary
–  What is a good definition of the word “knife”
•  What are the criteria for allocating marks to responses
•  Issue of reliability
•  Essential to removing experimenter bias

Test properties
•  Piloting process to establish key test perameters
–  Do items cover an appropriate range of abilities?
–  Do they reliably assess what they are supposed to?

•  Reliability and validity need to be tested as part of


piloting procedure
–  Reliability – Are test scores stable, dependable and relatively
free from error?
–  Validity – Does the test measure what it is purported to
measure?

4
Reliability
•  Test-retest: The extent to which a test yields the same
score when given to a participant on two different
occasions

•  Alternate-forms: Two different forms of the same test


on two different occasions to determine the
consistency of the scores

•  Split-half: Divide the test items into two halves; scores


are compared to determine test score consistency

Standard Error of Measurement


•  an estimate of the amount of variation to be expected
in test scores.
•  If the reliability correlations are poor, the standard
error of measurement will be large.
•  The larger the standard error of measurement, the less
reliable the test.
•  Standard error inflated by:
–  Small standardization sample
–  Too few items in test
–  Range of scores obtained by standardization sample

5
Validity
•  Content validity: Test’s ability to sample the content
that is being measured
•  Concurrent validity: The relation between a test’s
score and other available measures of the same
construct
•  Predictive validity: The relationship between test’s
score and future performance
•  Construct validity: The extent to which there is
evidence that a test measures a particular construct
•  Ecological validity: real world relevance

11

Norm Referencing: principles


•  Final and perhaps most crucial attribute of (some)
standardized tests
•  Based on tenet that human characteristics and
attributes (e.g. height, IQ) are distributed normally, or
follow a bell-shaped curve
•  Normal curve describes average population scores,
and scores that are below/above average
•  Norm-referenced tests allow you to compare the score
of an individual or subgroup with those of same-aged
peers from the test’s standardization sample
•  To do this, need to convert raw score to derived score
12

6
The normal distribution
average
range
Disorder

Derived scores
•  Conversion of raw to derived scores
•  Standard score = z score (mean = 0; sd = 1)
•  To calculate z:
–  z = (x – µ) / σ
–  For example, let’s say you have a test score of 190. The test
has a mean (µ) of 150 and a standard deviation (σ) of 25.
Assuming a normal distribution, your z score would be:
–  = 190 – 150 / 25 = 1.6.
–  You scored over 1.5sd above the mean

14

7
Derived scores
•  Conversion of raw to derived scores
•  Facilitates test interpretation
•  Wide range of derived scores
–  Percentiles
–  Standard scores (mean = 100, sd = 15)
–  Scaled scores (mean = 10, SD = 3)
–  Stanines
–  Age equivalent
•  ALL DESCRIBE MEAN AND SD

15

The normal distribution


average
range
Disorder

8
Selecting an appropriate
standardization sample
•  One of the main challenges of developing new,
culturally sensitive tests
•  Linguistic background of standardization sample
–  Monolingual, bilingual, trilingual
–  Education in L1?
•  What issues to consider in tests of:
–  Oral language
–  Written language
•  Size of standardization sample
•  How fine grained are age bands?

17

Representativeness
•  How representative is the standardizations sample of
the population from which it was drawn?
•  Consider the demographic characteristics of the
members of the standardization sample
–  age
–  sex
–  race/ethnicity
–  parent education level
–  geographic location (rural vs. urban)
•  Norm group should be proportioned across these
variables in a way that reflects prevalence in the
population from which sample is drawn.

18

9
Nature of the standardization sample
•  Will the test be used to assess children with special
needs (e.g. autism; dyslexia)?
•  If so, then the standardization sample should include
children with this profile
•  “Norms, if used, should refer to clearly described
populations. These populations should include
individuals or groups to whom test users will ordinarily
wish to compare their own examinees” (AERA et al.,
1999, p. 55).

19

Size of standardization sample

•  The larger the norm group the more representative the


sample is likely to be of the reference population
•  As a rule of thumb, at least 100 subjects per age or
grade level (Salvia & Ysseldyke, 1991).
•  Specify purpose of testing:
–  Test devised to assess a specific population (e.g. L1 Chinese
children’s English (L2) vocabulary may be appropriate to a
different population (e.g. Malay/Tamil L1 children’s English
(L2) population
–  (Be careful when adapting existing tests!)

20

10
Critical example: word reading
•  Diagnostic Test of Word •  Test of Word Reading
Reading Processes Efficiency
•  Reading accuracy •  Reading efficiency (no. items
•  3 subtests read correctly in 45 secs)
–  Nonwords •  2 subtests
–  Regular words –  Words
–  Exception words –  Nonwords
•  UK •  US
•  6 – 12 years •  6 – 25 years

Word reading
•  Different types of letterstring
–  Regular words vs irregular words vs nonwords
–  Mono-syllabic vs multisyllabic
–  High Frequency vs Low Frequency
–  All such factors will influence performance
•  Accuracy vs fluency
•  In spite of these issues, word reading is a relatively
clearly defined construct
•  -> Assessing word reading is reasonable
straightforward
•  -> Assessments tend to be reliable and valid

11
Critical example: Reading Comprehension
•  Many different existing tests of reading comprehension
•  Reading comprehension is a much more complex
construct than word recognition
–  Simple View (Tunmer & Chapman, 2012)
•  Language comprehension X word recognition
•  Different format of questions:
–  Cloze (sentence completion)
–  Multiple choice
–  Open question
•  Sentence vs. short text vs. long text comprehension
•  Do different tests assess the same thing?

Nation & Snowling (1997)


•  Comparison between two widely used UK tests of
reading comprehension
–  Neale Analysis of Reading for Comprehension (NARA)
–  Suffolk Reading Scale (SRS)
•  Decoding ability predicted performance on both tests
•  BUT language comprehension only accounted for
additional unique variance in performance on the
NARA – not of the SRS
•  SRS uses cloze format
•  -> cloze format resulted in the fact that SRS only really
assesses decoding, not comprehension

12
Keenan, Betjemann & Olson, 2008
•  Comparison between various tests
•  Does the format of a test determine what aspects of
reading it assesses?
•  Are there developmental changes in the extent to
which different tests tap different underlying
processes?

Keenan, Betjemann & Olson, 2008

(From Keenan et al., 2008)

13
Keenan, Betjemann & Olson, 2008

(From Keenan et al., 2008)

Keenan, Betjemann & Olson, 2008

(From Keenan et al., 2008)

14
Keenan, Betjemann & Olson (2008)
•  All about test format?
•  No – WJPC uses cloze format, but PIAT uses multiple
choice
•  Keenan et al. argue it is because of passages are
short in these two tests
–  In WJPC and PIAT, correct performance was often dependent
on having correctly decoded one single word
–  In a short passage, decoding problems likely to have a
catastrophic effect, whereas in longer passage, context may
help support problems decoding individual words
•  Why do decoding and language comprehension
account for so little variance?
–  Betjemann (2008) GORT: real world knowledge sufficient to
answer questions…

Developmental differences?

15
Keenan, Betjemann & Olson, 2008
•  Summary
–  Tests of reading comprehension do not all test the same thing
–  Some primarily test decoding/word recognition
–  Some primarily test linguistic comprehension
–  Some test different things at different age/reading level
–  Some test prior real world knowledge and depend on neither
linguistic comprehension or decoding!

Practical session
•  Norm referencing tests
•  In groups, pick construct (e.g. vocabulary; single word
reading; reading comprehension; grammar;
phonological awareness)
–  Have a look at some of the standardized tests we’ve brought
along
–  What are key issues to consider when
•  a) developing English language test for use with Malaysian
children
•  b) creating Malay/Tamil/Mandarin versions of tests

32

16
Why are standard scores useful?
•  Establishing performance in relation to peers
•  Comparing performance longitudinally across age
groups
–  E.g. monitoring effectiveness of intervention
•  Comparing across tests
–  A standard score of 100 is always average irrespective
of test
–  Discrepancy defined disorders e.g., dyslexia
–  Note though that different standardisation samples will
usually have been used across different tests

Pros and cons of standardized tests


-  Pros
-  Statistically rigorous: published reliability and validity data
-  Expertise and effort in piloting test items, ensuring that they
are sensitive to individual differences across age ranges
-  Fair: standardized administration and scoring procedures
ensures equal opportunity for success
-  Advantage of local over national norms? (Stewart & Kaminski,
2002)
-  Cons
-  Excessively reductionist
-  Only as good as their standardization sample
-  BUT - is it feasible to find representative sample in contexts of
great heterogeneity in background experiences?
-  Cultural fairness (Flanagan & Ortiz, 2007)

34

17
Alternative forms of assessment

•  Various measures have been suggested to avoid the


disadvantage to culturally and linguistically diverse (CLD)
populations
•  Processing dependent measures
–  More emphasis on processing, less on prior knowledge, e.g.
–  Tests of memory
•  Digit span, reverse digit span, non-word repetition
–  Perceptual task
•  Inspection time task; competing stimuli tasks; auditory figure
ground task
–  > e.g. Campbell, Dollaghan, Needlemen & Janosky (1997)
•  CLD children significantly lower performance on knowledge
based tasks, but no difference on processing dependent
measures
35

Alternative forms of assessment


•  Dynamic Assessment
–  Based in Vygotsky’s “Zone of Proximal Development”
–  Assess both current level of functioning, but also best way to
foster further development
•  “Test – teach – retest” procedure
–  E.g. Gutierrez-Clellen & Pena (2001)
»  2 bilingual Latin American children
»  1 responded well to mediated learning experience
–  Task/stimulus variablity
»  Change context in which child solves task
»  More concrete, multisensory
–  Graduated prompting
»  Children given prompts of various levels of
supportive cues
36

18
Reading
•  American Educational Research Association, American
Psychological Association, & the National Council on
Measurement in Education (1999). Standards for educational and
psychological testing. Washington, DC: American Educational
Research Association.
•  Flanagan, D. P., & Ortiz, S.O. (2007). Essentials of cross-battery
assessment (2nd ed.). New York: Wiley.
•  Keenan, J.M., Betjemann, R.S., Pennington, B.F. Willcutt, E., &
Olson, R.K. (2013). Reading comprehension tests vary in the
skills they assess: Differential dependence on decoding and oral
comprehension. Scientific Studies of Reading, 12, 281-300
•  Ornstein, A. C. (1993). Norm-referenced and criterion-referenced
tests: An overview. NASSP Bulletin, 77(555), 28–39.
•  Salvia, J., & Ysseldyke, J. E. (1991). Assessment (5th ed.).
Boston: Houghton Mifflin.

37

19

Potrebbero piacerti anche