Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
PSYCHOLOGICAL ASSESSMENT
Conceptual Paradigm for Measurement and Evaluation
Samples of
Behavior:
Measurement
- Mental Abilities
- Personality
-
Scales
(IRON)
-
Interval
Ratio
Ordinal
Nominal
Test
Single
Measure
Battery Tests
Assessment
Series of
tests
Various
Techniques
(DITO)
-
Evaluation
(RAP)
Recommendatio
n
Documents
- Action Plan
Interview
- Program
Psychopathology
Personality
Mental
Abilities
Test
Development
Observation
Diagnosis
- Traits
- General Intelligence
(g) IQ
- classification
- States
- Specific Intelligence
(s) Non-
- severity
- Types MBTI
Prognosis
- Aptitude
- Multiple Intelligence
- predicting
- Interest
the devt of
the d/o - Values
verbal IQ
Measurement (IRON)
Parametric: Normal Distribution of Scores
(Pearsons r)
Objective Tests
Psychological Tests
Projective Tests (WIDU)
Standardized
Wishes
Test Administration, Scoring,
Intrapsychic conflict conflict bet.
Interpreting test scores
desires & morals
Limited number of responses multiple
Desires
choice; true or false
Unconscious motives
Group Tests
Subjectivity on test
Norms
interpretation/clinical judgment
- norm-referenced test (NRT)
Self-administered/individual tests
Unlimited no. of responses
- criterion-referenced test (CRT)
Norms
where we base the scores of the test takers
-- transform scores into a meaningful scale
> NRT age norms
> CRT ex. how would we know if a basketball player is skillful? -> sharpshooter; theres certain criterion to
be met
2
Medium of Psychological Tests
Psychological Tests
Ability Tests
Intelligence Tests
-
Personality Tests
Achievement Tests
Verbal Intelligence
Non-verbal Intelligence
Personality
Tests
-
Object
ive
Traits /
Domains or
Factors
* Usually, no right
or wrong answers
Various skills /
competencies
Documents
-records, protocols,
collateral reports
-Initial assessment
> verification
Forms: written,
verbal, visual,
Screen applicants
Self-understanding
Classify people
Counsel individuals
Retain, dismiss, or promote employees
Research for programs, test construction
Evaluate performance for decision-making
Examine and gauge abilities
Need for diagnosis and intervention
Observation
-behavioral
observation
-observation
checklist
Content Validity
- degree to which the tests represent the essence, the topics, and the areas that the test is
designed to measure (appropriate domain)
- Primary concern of test developers because it is the content of the items that really
reflects the whatness of the property intent to be measured
- Ex. achievement, aptitude, personality tests
- table of specification (blueprint) (TOS) (under analysis)
TOS generate items checked/validated by (at least 3) experts a.k.a raters
Depression
X
Y
Optimism Optimism
(convergent)
Constructs
X
Y
Optimism Pessimism
Measure that correlates well with other tests believed to measure the same
construct
2. Divergent V (Discriminant) demonstrates that a test measures something different
from that other available tests measures
- A tests should have low correlations, or evidence for what the test does not
measure
-
psychological
tests are not in fact true of real
Observed Test Score
= True Score + Measurement
Error
scores. But, rather, those scores represent a
X=T+e
combination of many factors.
In theory, the reliability coefficient (rxx) gives us an index of the influence of true scores and
error scores on any given test. It is the ratio of true score variance of the total variance of the test.
In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that
this r represents an rxx.
Models / Types of Reliability (the type depends on what test you are going to measure)
1. Test Retest Reliability Pearsons r
- Gives the same test to the same group of test takers on 2 different occasions
- Scores on the 1st administration are compared to scores on the 2 nd administration
using r
- 15 days or a month
- Too early familiarity (carryover effect); too long maturity
- Often researchers consider this to be a better measure of temporal stability
(consistency of test scores..)
- Assumption: people dont change on 2 administrations
- PROBLEM: Practice or carryover effects ( beneficial to the test takers)
2. Alternate Forms of Reliability r
- To eliminate the practice effects and other problems with the test-retest method (i.e
reactivity), test developers often give 2 highly similar forms of the test to the same
people at different times.
- Reliability, in this case, is again assessed at different times.
- To develop alt. form that is equivalent in terms of content, response, and statistical
characteristic.
- PROBLEM: difficulty of developing anotherSpearman-Brown
(equivalent; same
difficulty) form of the
Formula
test
kr
rxx=
( 1+ ( k 1 ) ) r
rxx reliability coefficient
r coefficient
k
1. KR20 (Kruder & Richardson, 1937, 1939) for tests which questions can be scores
either 0 or 1 (binary; dichotomous)
2. Coefficient alpha (Cronbach, 1951) rating scales that have 2 or more possible
answers
Problem: whether the test being split is homogenous (i.e measuring one characteristic)
or heterogenous (i.e measuring many characteristics)
every item is
compared to one another
Standard Deviation:
high heterogenous (more
spread)
low homogenous (less
spread)
*in terms of scores
trait score sources of errors that reside within the individual taking
the test (excuses:
hungry,
Reliability=
True Score
True Score+ Error Score
Interrater reliability =
Number of agreements
Number of disagreements
error reliability
Stability the same results are obtained over repeated administration of the instrument.
- Test-retest reliability
- parallel, equivalent, or alternative forms
Homogeneity
internal consistency (unidimensional)
- item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbachalpha
Item-total correlations each item on an instrument is correlated to total score an item with
low correlation may be deleted. Highest and lowest correlations are usually reported
- only important if homogeneity of items is desired
Kuder-Richardson coefficient when items have dichotomous response e.g. yes/no (binary)
Cronbachs-alpha Likert scale or linear graphic response format
- compares the consistency of responses of all items on the scale (may need to be
computed for each sample)
Equivalence
consistency of agreement of observers using the same measure among
alternative forms of a tool
- parallel of alternate forms (described under stability)
- interrater reliability
TEST CONSTRUCTION (has rudiments, process)
Test Planning
Decision to develop a Standard Test
(1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not
adequate for one reason or another.
Weschlers idea of WAIS was originated from the army alpha (literate soldiers) and army
beta (illiterate soldiers), thats why there are vocabulary and performance tests.
Subject Matter Experts test developer must seek help of the experts in evaluating the test
items and even the identified constructs of component of the test
Writing Items depending on whether the scale is to assess an attitude, content knowledge,
ability or personality traits; stick to the pattern (ex. dont shift from declarative to
interrogative statement)
Guidelines
1. Deal only with one central thought; more than 1 is called double-barreled.
Poor item:
Better item:
2. Be precise
Poor item:
Better item:
3. Be brief
4. Avoid awkward wording or dangling constructs.
Poor item:
Being clear is the overall guiding principle in writing items.
Better item: The overall guiding principle in writing items is to be clear.
* Active voice is more preferred than passive voice.
Item Analysis
Classical Test Theory Latent Trait Models
(CTT)
Item Response
Rasch Models
Theory (IRT)
1P 2P 3P 4P
similar
Level of Difficulty proportion of percent of examinees that answered the item correctly.
In order to determine the difficulty level, table the number of examinees with the correct
answer in the item and then apply the formula.
P=
Nu
N x 100
10%
and below
moderate
Difficult
Very difficult
Acceptable
Unacceptable
4. The top 27% of the examinees is considered the upper group while the bottom 27% of the
total examinees comprises the lower group
5. Get both frequencies of the examinees that answered the item correctly from the 2 groups
6. Determine the difficulty level and the discriminating power
Discriminating Power determines the difference between examinees who have done well and
those who did poorly in a particular item. To determine the discriminating level, perform
the steps in the difficulty level, then, determine the difference of the 2 groups and divide
the difference by the half of the total examinees. (? Di natapos)
Discriminability
Item/Total Correlation
every item will be correlated to the total score
point biserial method is best used
Point Biserial Method
dichotomous scored items / items with a correct answer
one dichotomous variable (correct/incorrect) correlated with one
continuous variable (total score) is a point biserial correlation
correlate the proportion of people getting each item with the total
test score
CTT
LTM
statistics
are
often
generalized
to
similar
students taking a similar test
only
applies
to
those
students taking that test
Latent Trait Models (LTM) made in 1940a but widely used in 1960s
practically unfeasible to use these without specialized software
Item Response Theory (IRT) family of latent trait models used to establish psychometric
properties of items and scales
sometimes referred to as modern psychometrics because has completely
replaced CTT
can predict if one has guessed an item
3 Basic Components
on a construct)
1. Item Response Function (IRF) math function that related the latent trait to the
probability of endorsing an item.
good item