Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
PSYCHOLOGICAL ASSESSMENT
Conceptual Paradigm for Measurement and Evaluation
Samples of
Behavior:
- Mental Abilities
- Personality
-
Measurement
Scales
(IRON)
-
Interval
Ratio
Ordinal
Nominal
Test
Battery Tests
Assessment
Series of
tests
Various
Techniques
(DITO)
Single
Measure
Documents
Interview
Test
Observation
Evaluation
(RAP)
Recommendatio
n
- Action Plan
- Program
Development
Psychopathology
Mental Abilities
- Diagnosis
- General
Intelligence (g) IQ
Personality
-- classification
Specific
Intelligence (s) Non- Traits
-verbal
severity
IQ- States
- Prognosis
- Multiple
Intelligence
- Types
MBTI
- predicting the
- Aptitude
devt of the d/o
- Interest
- Values
Measurement (IRON)
Parametric: Normal Distribution of Scores
Non-Parametric: Abnormal Distribution of Scores
(Pearsons r)
(Spearman, (chi-square(nominal))
Interval: Temperature, Time, (IQ) has no
Ordinal: Rank, Positions, Likert Scale, Birth Order
absolute zero
Ratio: Weight, Height has absolute zero
Nominal: Sex, Civil Status classifying
*has absolute zero: weight there could be no or 0 [value of] weight
has no absolute zero: temperature theres no 0 or no temperature
normal distribution of scores if the mean, median, mode are all the same (measures of central
tendency)
abnormal distribution of skewed
Psychological Tests
Objective Tests
Standardized
Norms
- norm-referenced test (NRT)
2
Psychological Tests
Ability Tests
Intelligence Tests
Verbal Intelligence
Non-verbal Intelligence
Achievement Tests
measures the extent if
ones knowledge
various academic subject
Personality Tests
Personality Tests
Traits / Domains
or Factors
Ex. Myers-Briggs Test
Inventory
-
Objectiv
e
* Usually, no right or
wrong answers
Documents
-records, protocols,
collateral reports
Screen applicants
Self-understanding
Classify people
Counsel individuals
Retain, dismiss, or promote employees
Research for programs, test construction
Evaluate performance for decision-making
Examine and gauge abilities
Need for diagnosis and intervention
Observation
-behavioral observation
-observation checklist
Content Validity
-
degree to which the tests represent the essence, the topics, and the areas that the test is designed to
measure (appropriate domain)
Primary concern of test developers because it is the content of the items that really reflects the whatness
of the property intent to be measured
Ex. achievement, aptitude, personality tests
table of specification (blueprint) (TOS) (under analysis)
TOS generate items checked/validated by (at least 3) experts a.k.a raters
Depression
*Suicidal
Ideation |* SelfDomains* (* in the box )
harm
2.
3.
4.
Construct Validity
Theoretical domains, factors / components
Personality
X
Y
Optimism Optimism
(convergent)
Constructs
X
Y
1. Convergent V direct correlations between variables ( XY)
Optimism Pessimism
Measure that correlates well with other tests believed to measure the same construct
2. Divergent V (Discriminant) demonstrates that a test measures something different from that other
available tests measures
A tests should have low correlations, or evidence for what the test does not measure
Criterion-related Validity is estimated by correlating a subjects score on a test with an analysis of their
behavior on an independent real life criterion. If this criterion you need to assess and correlate is occurring now,
you are assessing concurrent validity. If the assessment criterion is to occur in the future, you are assessing
predictive ability.
Construct Validity (a.k.a true validity) is the extent to which there is evidence that a test measures a
particular hypothetical construct. For example, are we really measuring intelligence with an IQ test where there are
so many competing theories regarding what intelligence actually is?
Coefficient value estimate value
Variability margin of errors (because were human beings)
Unsystematic error can result from varied assessment implementation. E.g. scoring via raters
RELIABILITY consistency
scores on any given test. It is the ratio of true score variance of the total variance of the test.
In actuality, rxx is very similar to correlation (r). The addition of 2 similar subscripts tells us that this r
represents an rxx.
Models / Types of Reliability (the type depends on what test you are going to measure)
1.
Test
-
2.
Spearman-Brown Formula
rxx=
where
kr
( 1+ ( k 1 ) ) r
KR20 (Kruder & Richardson, 1937, 1939) for tests whichr questions
coefficientcan be scores either 0 or 1
(binary; dichotomous)
k
Coefficient alpha (Cronbach, 1951) rating scales that have 2 or more possible answers
Problem: whether the test being split is homogenous (i.e measuring one characteristic) or
heterogenous (i.e measuring many characteristics)
every item is compared to
one another
Split-half reliability is mostly similar to internal consistency.
halves of the were (correlated) measured
3. Scorer Reliability (inter-rater reliability) judgments or ratings made by different scorers are often
compared using correlation to see how much they agree.
If tests are being used to make important final decision about people then the reliability of a test
should be high (0.95)
Lower reliability levels may be acceptable when:
Making preliminary decisions,
Sorting people into groups,
Conducting research, etc.
The goal is to increase the probability of getting the true score and minimizing the standard
error of measurement.
Test score is composed of observed score (actual score), true score (reflection of what you really know),
and error score (difference between the true score and the actual score)
Reliability=
True Score
True Score+ Error Score
Interrater reliability =
Number of agreements
Number of disagreements
error reliability
Stability the same results are obtained over repeated administration of the instrument.
- Test-retest reliability
- parallel, equivalent, or alternative forms
Homogeneity internal consistency (unidimensional)
- item-total correlations; split-half reliability; Kuder-Richardson coefficient; Cronbach-alpha
Item-total correlations each item on an instrument is correlated to total score an item with low correlation
may be deleted. Highest and lowest correlations are usually reported
- only important if homogeneity of items is desired
Kuder-Richardson coefficient when items have dichotomous response e.g. yes/no (binary)
Cronbachs-alpha Likert scale or linear graphic response format
- compares the consistency of responses of all items on the scale (may need to be computed for
each sample)
Equivalence consistency of agreement of observers using the same measure among alternative forms of a tool
- parallel of alternate forms (described under stability)
- interrater reliability
TEST CONSTRUCTION (has rudiments, process)
Test Planning
Decision to develop a Standard Test
(1) No test exist for a particular purpose or (2) the test existing for a certain purpose are not adequate for one
reason or another.
Weschlers idea of WAIS was originated from the army alpha (literate soldiers) and army beta (illiterate
soldiers), thats why there are vocabulary and performance tests.
Weschler both covers fluid and crystallized intelligences
difference between the two,
Culture Fair Intelligence test looks into specific intelligence
in terms of defining intelligence
Subject Matter Experts test developer must seek help of the experts in evaluating the test items and even the
identified constructs of component of the test
Writing Items depending on whether the scale is to assess an attitude, content knowledge, ability or personality
traits; stick to the pattern (ex. dont shift from declarative to interrogative statement)
Guidelines
1. Deal only with
Poor item:
Better item:
2. Be precise
Poor item:
Better item:
3.
4.
5.
6.
7.
8.
Be brief
Avoid awkward wording or dangling constructs.
Poor item:
Being clear is the overall guiding principle in writing items.
Better item:
The overall guiding principle in writing items is to be clear.
* Active voice is more preferred than passive voice.
Avoid irrelevant information
Present items in positive language
* If its inevitable, when using not, italicize or CAPITALIZE it.
Avoid double negatives
Avoid terms like all and none
Poor item:
Which of the following never occurs
Better item:
Which of the following is extremely unlikely to occur?
Affective Domain most with the values of a learner including his interests, appreciation, and attitudes
Psychomotor readiness for a particular action that may either be mental, physical, or emotional
Item Analysis
Way of measuring the quality of questions seeing how appropriate they were for the respondents and
how well they measured their ability / trait
Way of measuring items over and over again in different tests with prior knowledge of how they are going
to perform, creating a population of questions with known properties (e.g. test bank)
At least 3 or 4 times more
Item Analysis
moderate
10%
and below
Difficult
Very difficult
Acceptable
Unacceptable
Discriminating Power determines the difference between examinees who have done well and those who did
poorly in a particular item. To determine the discriminating level, perform the steps in the difficulty level,
then, determine the difference of the 2 groups and divide the difference by the half of the total
examinees. (? Di natapos)
Discriminability
Item/Total Correlation
Point Biserial Method
LTM
only applies to
taking that test
those
students
Latent Trait Models (LTM) made in 1940a but widely used in 1960s
practically unfeasible to use these without specialized software
Item Response Theory (IRT) family of latent trait models used to establish psychometric properties of
items and scales
sometimes referred to as modern psychometrics because has completely replaced CTT
can predict if one has guessed an item
3 Basic Components
(ex. individual differences on a
construct)
1. Item Response Function (IRF) math function that related the latent trait to the probability of endorsing
an item.
good item
2. Item Information Function an indication of item quality, an items ability to differentiate among
respondents.
3. Invariance item characteristics
Item Response Theory (IRT) the relationship between examinee trait level, item properties and the ability
of endorsing the item.
can be converted into Item Characteristic Curves (ICC) which are graphical
functions that represents the respondents ability.
Item Parameters Location an items location b is defined as the amount of the latent trait
needed to have a 0.5 probability of endorsing the item.